


, ye) ae z : x. 
herp + «bees Cea + 
4 7 ened. ark wee 


Seesreresser ere 


en 











VoL. 54, No. 2 


Marcu, 1957 


Psychological Bulletin 


MEASUREMENT OF REPRODUCIBILITY! 
BENJAMIN W. WHITE 
Lincoln Laboratory, Massachusetts Institute of Technology 


AND ELI SALTZ 
Air Force Personnel and Training Research Center, Chanute Air Force Base, Illinois 


Much of our knowledge of human 
behavior is based upon data 
through the administration 
of multiple-choice tests to groups of 
subjects. Such instruments are used 
in many ways: selection, attitude 
measurement, ability measurement, 


ob- 
tained 


and clinical diagnosis, to name only 
a few. Particularly since the publi- 
cation of Guttman's model for meas- 
reproducibility (7), 
increasing concern 
over one aspect of the responses of 
groups of subjects to groups of 
items—the extent to which the pat- 
terns of subjects’ responses can be 
predicted from their total 
While these considerations have been 


uring a test's 


there has been 


scores, 


of great interest to social and clinical 
psychologists, they have also proved 
pertinent to constructors of ability 
tests. It is the purpose of this article 
(a) to examine the tech- 
niques which have been devised to 
‘reproducibility,’ “ho- 
mogeneity,”’ or internal consistency, 
(b) to 


some of 


assess a test's ‘ 


evaluate these techniques 
against certain criteria, and (c) to 
suggest possible logical relationships 
of these techniques to the concept of 


reliability. 


! The opinions and conclusions contained 
in this article are those of the authors. They 
are not to be construed as reflecting the views 
or endorsement of the Department of the Air 
Force 


In the ensuing discussion the word 
test will be used to describe any tech- 
nique whereby two or more subjects 
respond to two or more stimuli in 
such a way that the responses of all 
subjects to each item can be dichot- 
omized. It is assumed that every 
subject responds to every such item. 
It is further assumed that the experi- 
menter assigns a value of unity to 
all responses on one side of the dichot- 
omy and a value of zero to the rest. 
A “total score’ for a subject is com- 
puted by adding the weights assigned 
to his responses thus dichotomized. 
With this system, a subject's total 
score is the number of responses he 
has made which fall into the unity- 
weighted class. 

Often such scores are presumed to 
yield an ordering of the subjects on 
some hypothetical linear continuum, 
ability, or trait. For some time social 
scientists have been aware that this 
process of assigning a simple order 
to people on the basis of their re- 
sponses to a number of test items is 
a legitimate representation of their 
test behavior only when their re- 
sponses possess certain characteris- 
tics. There are many ways of stating 
this, but for the purposes of this dis- 
cussion, it will be most convenient to 
use the following: a total score, com- 
puted by counting the number of 
test responses which have been classi- 
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fied in one of two ways, will yield a 
perfect mapping of the entire pattern 
of responses of all subjects when, and 
only when, the interitem covariances 
are maximal. 

For purposes of illustration, con- 
sider a six-item test. On such a test, 
total scores can take seven possible 
values from 0 to 6. When interitem 
covariance is maximal, there is only 
one way in which a subject can make 
any given total score. Naturally he 
can make a total score of 0 only by 
failing’ all six items, and a score of 
6 only by “passing”’ all six items. He 
can make a score of 1 only by passing 
theeasiest item. By “easiest’’ismeant 
the item which was passed by more 
subjects than any other. Similarly he 
can make a score of 2 only by passing 
the two easiest items. In other words, 
given the information that the inter- 
item covariances are maximal, the 
order of difficulty of the items, and a 
subject's total score, one can tell ex- 
actly which items the subject got 
wrong and right. On such a test there 
are only seven ways in which people 
respond to the items, and each of 
these corresponds with one of the 
seven possible total scores. 

At the other extreme, consider a 
test 
pendent, i.€., 


whose items are inde- 

exhibit zero covari- 
Such a test could yield 2* or 
64 different response patterns. There 
would be 15 different ways in which a 
person could get a total score of 2, 
for example. In this situation, given 
knowledge of the total score, the 
order of difficulty of the items, and 
the fact of zero covariance between 
items, one would not be able to re- 
construct a subject's pattern of re- 
sponses to the test, unless the total 
score happened to be 0 or 6. Repre- 
sentation of the test behavior of the 
subjects with the conventional total 
score would result in a considerable 
loss of information. 


six-item 


ances, 
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Various indices have been devel- 
oped which will permit the tester to 
ascertain the degree to which the 
total scores of a given test yield a 
complete mapping of the responses 
of all subjects to all the items (repro- 
ducibility). These indices differ not 
only in their computational formulas, 
but in their underlying assumptions, 
though all start with the same pri- 
mary data: the dichotomized re- 
sponses of a group of subjects to a 
group of test items. Four criteria are 
suggested against which each index 
may be evaluated. 

1. Does it yield a theoretical maxi- 
mum value which is the same for any 
test? 

2. Does it yield a theoretical mini- 
mum value which is the same for any 
test? 

3. Does it permit evaluation of the 
null hypothesis that the obtained repro- 
ducilility index is not significantly 
different from chance? 

4. Does it permit evaluation of each 
item in the test as well as of the test 
as a whole? 

The rationales for these criteria are 
reasonably straightforward. If maxi- 
mum or minimum possible values dif- 
fer from test to test, it is difficult to 
evaluate one test against another. 
For example, two tests having repro- 
ducibility quotients of .90 are dif- 
ferently evaluated when it is dis- 
covered that the minimum theoreti- 
cal reproducibility of one is .60, and 
of the other .90. If the quotient does 
not have a known sampling distri- 
bution, there is the possibility that 
the obtained quotient does not differ 
significantly from chance. And 
finally, if the items can not be eval- 
uated it is difficult to improve repro- 
ducibility by omission or inclusion of 
specific items. 

In the light of these criteria, we 
propose to discuss several techniques 
which have been devised to yield an 
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rABLE 1 
RESPONSES OF TEN SUBJECTS TO A S1x-ITEM 
Test WHERE Rows AND COLUMNS 
ARE UNORDERED 


Item 
Subject 
4 


Item 
Difficulty 


In order to 
the computations in- 
volved in each technique, we shall 


index of reproducibility. 
demonstrate 


use the responses of ten subjects to 
a six-item test, illustrated in Table 1. 

In this matrix the rows represent 
subjects and the columns test items. 
The marginal entries at the bottom 
of the matrix indicate the number of 
subjects who “‘passed"’ a given item, 


and the marginals in the last column 
of the matrix represent the number 
of items each subject “passed.” 


(GUTTMAN’'S REPRODUCIBILITY 


Guttman (7) originated the term 
reproducibility. The term means es- 
sentially the degree to which one can 
reproduce a subject's entire response 
pattern from a knowledge of his total 
score and the order of difficulty of the 
items. Originally Guttman’s tech- 
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nique of obtaining the index of repro- 
ducibility involved mechanical opera- 
tions on a matrix of N subjects and 
K test items similar to Table 1. A 
device, the scalogram board, per- 
mits interchange of rows and columns 
of this matrix in a particular manner 
so that the unity entries are maximal- 
ly concentrated above the main di- 
agonal of the matrix. Such rearrange- 
ment of the response matrix in Table 
1 is shown in Table 2. 

It should be noted that if there are 
any ties in total score or in the num- 
ber of subjects passing items, the ar- 
rangement of the matrix may not be 
unique, In this example the order of 
the columns is unique since there are 
no ties in the number of subjects 
passing items, but the order of rows 
is not, since there are two subjects 
at each total score level. In such 

TABLE 2 
RESPONSES OF TEN SUBJECTS TO A S1x-ITEM 
Test WHere Rows aNp CoLUMNS 
ARE ORDERED 


| 


Item 


Potal 


Subject Pts 


l21}6l4/1/5 


0 


Item 
Difficulty 
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TABLE 3 


Jackson's Metnopv or ComputinGc Repropucisititry (RX), MintwumM REPRODUCIBILITY 
(ATR), AND PLUS PERCENTAGE Ratio (PPR) 


Subject 


A 


# right (7) 


# wrong ((V) 

Errors 

Ry 90 

MR, 80 

PP, 10 20 
PPR, 50 67 


Note.—Rights are listed under +. 


Item 


40 


1.00 


Wrongs are listed under 


lotal errors =7; R,=88%; MR,=70%; PP, =18%; PPR, =.61 


cases further permutations of rows 
and columns are made until errors are 
minimized. The index of reproduci- 
bility is a function of the number of 
errors, i.e., unity entries which are 
below the diagonal and zero entries 
which are above it. This diagonal 
is not necessarily exactly coincident 
with the main diagonal, and Gutt- 
man has several rules to be followed 
in its determination. Since Gutt- 
man's original procedure is unwieldy, 


we shall in this illustration use a 
procedure developed by Jackson (10) 
for arriving at cutting points for each 
item. For all practical purposes, 
Jackson's R, quotient is identical 
with Guttman’s.? Jackson's method 
is illustrated in Table 3 above. 


* [t should be noted that many people have 
suggested modifications in the calculations of 
Guttman's R, (3, 6, 11, 17, 18). These re- 
finements of procedure are, by and large, 
identical in their logical properties with 
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same matrix shown 
in Table 2, except that the unity 
and zero entries under each item 
have been placed in separate col- 
umns. In order to draw cutting 
points, one simply draws a line 
each column at the place 
where the number of zero entries 
above the line and the number of 
unit entries below the line (errors) 
are minimized. These cutting points 
are seen as descending steps in the 
table. In the first column there is 
one entry of unity which falls below 
the cutting line and this has been 
put in parentheses. If the cutting 
line had been drawn directly below 
this unity entry, the five zero entries 
above it would be counted as errors 
and put in parentheses. In this illus- 
tration there is a unique cutting 
point for five of the six items, i.e., a 
line which yields an absolute mini- 
mum number of errors. In Item 5 
however, the line could be either 
where it is drawn, or two rows higher. 
Either solution yields 1 error. The 
lower one was because it 
yielded an additional cutting point 
for the scale,’ whereas the higher cut- 


This is the 


across 


chosen 


Guttman’'s quotient, and were so intended 
by their authors. Consequently no space is 
given to them in this article. 

*In Jackson's method, the cutting points 
are used to determine minimum number of 
errors. Once the minimum number of errors 
has been determined the exact locations of the 
cutting points no longer enter into the com- 
putation of reproducibility. Consequently, 
for Jackson's method it doesn't matter which 
of the two cutting points is used for Item 5, 
both result in However, 
Guttman’s original procedure made use of 
Guttman as- 
signed the cutting points to the row marginals 
(the total scores) and then rescored every S 
on the basis of the cutting points. All Ss be- 
low the lowest cutting point would be scored 
as having failed all the items. In Table 3, 
for example, Item 3 has the lowest cutting 
point; subject A is below this cutting point 
and so he would be rescored as having failed 
all the items. All Ss between the first and 
second cutting points would be rescored as 
having passed one item. And so forth. The 


since one error. 


the specific cutting point used. 


ting point would have been identical 
for that of Item 1. 

After the cutting points have been 
assigned, the errors in each column 
are counted. From these it is pos- 
sible to compute the reproducibility 
for each item (R,) by dividing the 
number of errors (£) by the number 
of subjects (N) and subtracting the 
quotient from 1. 


(1] 


The reproducibility for the entire 
test (R,) may be computed by sum- 
ming the errors for all items 


(2) 


dividing this by the number of sub- 
jects (m) times the number of items 
(k), and subtracting the quotient 


from 1. 
k 


DE 
t—1 
NK 
For this example, the reproducibility 
of the test is 88.3 per cent, somewhat 
below the 90 per cent figure which 
Guttman uses as criterion of 
scalability. 

The Guttman index of reproduci- 
bility meets our first criterion in that 
it has an absolute maximum of 100 
per cent for any test with more than 


R,=1-— (2) 


one 


one item, and our fourth criterion 
in that one can compute the index for 
each item as well as for the test as a 
whole. However, it suffers a serious 
shortcoming in 
minimal value. 


having no unique 


As Jackson (10) 


Guttman reproducibility index indicates the 


percentage of actual reproducibility as com- 
pared with the reproducibility 
obtained by these rescoring processes. There- 
fore, if two items are given the same cutting 
point, the number of different classes or “cut- 
ting point scores” will be decreased—that is, 
the number of discriminations made by the 
scale is diminished. 


maximum 
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and others (1, 2, 13, 14) have pointed 
out, the index of reproducibility is 
drastically affected by the difficulty 
levels of the items in a test. The 
reason for this is that the difficulty of 
an item (percentage of persons pass- 
ing) places a limit on the likelihood of 
an error: passing a difficult item, and 
failing an easy one. The reproduci- 
bility figure can approach its abso- 
lute lower limit of 50 per cent only 
when all the items have a difficulty 
level of 50 per cent, a trivial case in 
which 100 per cent reproducibility 
could be obtained only if one-half the 
subjects passed all the items while 
the other half failed all the items. 
With even slight departures from 
this strict condition, the lower limit 
of the reproducibility index rises 
sharply. In our illustrative example 
minimum reproducibility is 70 per 
cent. This fact makes it exceedingly 
difficult to evaluate an obtained in- 
dex of reproducibility. With short 
scales and wide spread in item diffi- 
culties, Guttman’s figure of 90 per 
cent very little 
higher than the minimum reproduci- 
bility of the scale. 


may on occasion be 


Jackson's Plus PERCENTAGE 
Ratio (PPR) 

In order to circumvent this draw- 
back of Guttman's reproducibility 
index, Jackson (10) has developed 
another statistic which he calls the 
Plus Percentage Ratio (PPR). Unlike 
the Guttman index, PPR has the 
same absolute minimum for all tests. 
Referring again to Table 3, note the 
minimum reproducibility figures in 
the row labelled MR,. Here the mini- 
mum reproducibility figure for each 
item (MR,) is obtained by dividing 
the number of subjects who got a 
given item right (# right), or wrong 
(# wrong), whichever figure is the 
larger, by the number of subjects 


(N). 


# rights or # wrongs 


(whichever is larger) 
MR,=——____—_—_—_—_—_-. [3] 
N 
The minimum reproducibility for 
the entire test (MR,) is computed by 
taking for each item the number of 


rights 
» 
( } a rights ) 
i) 


or the number of wrongs 


k 
( z # w rongs), 
t—1 


whichever number is larger, summing 
the numbers so obtained over all 
items and dividing this sum by the 
product of the number of items (K) 
and the number of subjects (4). 


‘ # rights, or # wrongs 


iat (whichever is larger) 


MR,= 
KN 

In the next to the last row of Table 
3, the “Plus %,’ (PP,) figures are 
listed. Here the differences between 
the obtained reproducibility and the 
minimum reproducibility (Ry;— MR,) 
for each item are entered. These 
figures indicate how much better ob- 
tained reproducibility is than the 
minimum for that item. In the last 
row, the ‘Plus % Ratios” (PPR,) 
are entered for each item. These 
figures may be obtained by dividing 
the Plus % figure for a given item 
by one minus the minimum repro- 
ducibility (7R,) for that item. 


Ri—- MR, 
1—-MR, 


PPR = [5] 

The Plus Percentage Ratio for the 
total test (PPR,) is similarly com- 
puted by dividing the difference be- 
tween R, and MR, by one minus 


MR. 
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R.—MR, 
PPR,=- —-s 
1—MR, 

The Plus % Ratio has a distinct 
advantage over the index of repro- 
ducibility in that it has both an ab- 
solute maximum of one and an abso- 
lute minimum of zero for any test of 
more than one item. For the test il- 
lustrated here the PPR, is .61. As 
Jackson points out, testers should be 
prepared for the fact that this index 
will almost inevitably be lower than 
the Guttman index of reproducibility, 
often considerably lower. The index 
has not often been used on well- 
known tests, so it is difficult to say 
what an acceptable level should be. 
Jackson tentatively suggests 70 per 
cent. It remains to be seen whether 
this figure is a reasonable one in men- 
tal testing or attitude scaling. The 
PPR in any event has much to 
recommend it since it circumvents 
one of the serious criticisms 
which has been leveled at Guttman's 
reproducibility index. 


[6] 


most 


LOEVINGER’S INDEX OF Homo- 
GENEITY (/1) 
Homogeneity of a Test (H;) 


A rather different approach to the 
measurement of the reproducibility 
of mental tests has been put forth by 
Loevinger, who uses the following 
as a definition of homogeneity (13, 
p. 29). 


The definitions of perfectly homogeneous 
and perfectly heterogeneous tests can be re- 
stated in terms of probability. In a perfectly 
homogeneous test, when the items are ar- 
ranged in the order of increasing difficulty, if 
any item is known to be passed, the probability 
is unity of passing all previous items. Ina 
perfectly heterogeneous test, the probability 
of an individual passing a given item A is the 
same whether or not he is known already to 
have passed another item B. 


It can be seen that this definition 
comes quite close to the Guttman 
notion of reproducibility, and in fact 


the perfectly reproducible and the 
perfectly homogeneous test are iden- 
tical. 

With the test items arranged in or- 
der of increasing difficulty, Loevinger 
computes the quantity S by finding, 
for all pairs of items, the proportion 
of subjects who have passed both 
items (P,;). From this is subtracted 
the theoretical proportion who would 
have passed both items had they been 
independent (P,P,;). These differ- 
ences are summed over the k(k—1) 
pairs of items (i.e., each item is paired 
with every other item in the test), 


k-1 h 
S= > > Pis— PP. 

tl jemi+l 
For a test made up of completely in- 
dependent items, S would have a 
value of zero. S does not have an 
upper limit of unity when the test is 
perfectly homogeneous. The upper 
limit is fixed by the proportion of 
subjects passing the more difficult 
item in each pair (P;). 


k1 


Sut = > 


b 
Dd Ps- PP; — ([8i 


tol jwi+l 


The homogeneity of a test (J) is 
then given by the ratio of these two 
quantities 
S 
H.= =. [9] 


S 
mart 


This procedure is exactly analogous 
to that used by Jackson in comput- 
ing the Plus Percentage Ratio. This 
can be seen more easily if Loevinger's 
equation is rewritten as follows: 


DL DY (1— Pa) -(1— PsP) 


tel jeettl 


=— —— —. [10] 


kl ke 


> Dd 1-(1— PP; 


tol jgut+l 
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The first term in the parentheses of 
the numerator (1—P,;) indicates 
the proportion of subjects passing a 
harder item and failing an easier one 
subtracted from unity. This is very 
like the reproducibility coefficient 
which is given by the proportion of 
errors subtracted from unity. The 
second term in the numerator (1 
—P,P,) is the product of the propor- 
tion of subjects passing the harder 
item and the proportion failing the 
easy item, this product then sub- 
tracted from unity. The quantity 
(1—/P,P;) is analogous to Jackson's 
minimum reproducibility. The de- 
nominator is seen to be the difference 
between unity (perfect reproduci- 
bility) and minimum reproducibility. 
The two methods differ only in the 
procedure for counting errors. Loe- 
vinger’stechnique involves the equiva- 
lent of an examination of all pairs of 
items 1#j and counting every occa- 
sion upon which the harder itern is 
passed and the easier item failed. In 
the illustrative example, such a tabu- 
lation yields a total of 13 errors, 
whereas Jackson's error count is 7. 
This is the reason that Loevinger's Hi, 
will usually be lower than Jackson's 
PPR, The former is .23, and the 
latter .61. In Jackson's system for 
counting errors, a deviant response is 
counted only once no matter where it 
For 
example, if items are arranged in or- 
der of decreasing difficulty, a response 
pattern of (1, 0, 0, 0) would be 
credited with one error, while in 
Loevinger's system, since the passed 
item was the hardest of the four, 
there would be three errors. The two 
methods also have somewhat differ- 
ent ways of computing minimum re- 
producibility, Jackson’s yielding a 
figure of .70, and Loevinger’s .72. 
Loevinger points out that 
formula for H, is equivalent to 
a7 — Oo hes 


-) 11 


” 
om ~~ 0 not 


occurs in the response pattern. 


her 


where all the variances refer to total 
raw scores. The first term in the 
numerator (¢?,) is the variance of the 
obtained scores, the second numera- 
tor term (0*,,,) is the variance of the 
total scores which would be obtained 
from items of the same difficulties 
which were completely independent, 
and the first term in the denomina- 
tor (0%...) is the variance in total 
scores which would be obtained if 
the same items were perfectly corre- 
lated. The raw score variance of a 
test made up entirely of independent 
items is the familiar 


Dd p4 


or the sum of the item variances. The 
raw score variance of a test made up 
wholly of perfectly correlated items is 
given by 


k 
> PQ, 


1 


o* hom = 


k-1 k 
+2>> >> Pj—P.P;. [12] 


tel jet+l 


The first term on the right of this 
equation is the item variance em- 
ployed above, and the second term is 
two times a sum which is seen to be 
identical to Shas. 

This relationship is interesting 
since it shows that total score vari- 
ance increases with reproducibility, 
being at a minimum when the item 
covariances are zero, and reaching an 
upper limit when item covariances 
are maximal. 

Both Loevinger’s H, and Jackson's 
PPR have the advantage of being 
uninfluenced by the distribution of 
item difficulties which makes them 
preferable to the Guttman reproduci- 
bility index when it is given without 
further information. The procedures 
are objective and can be reduced to 
routine computations. When “‘errors”’ 
occur mainly on item pairs which are 
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close together in difficulty level, the 
two procedures should yield practi- 
cally identical indices, but if there 
are “errors’’ which occur in item 
pairs which are widely different in 
difficulty level, Loevinger’s 7/7, will be 
lower than Jackson’s PPR; Loe- 
vinger’s technique has the aesthetic 
advantage of making full use of the 
information contained in the response 
matrix, but the practical drawback 
of being tedious to compute when the 
number of items is large since k(k —1) 





product of the number of passes on 
the item (P) and the number of fails 
(Q). Loevinger points out that diffi- 
culties arise when two subjects have 
identical total scores, one of whom 
has failed the item and the other has 
passed it. There is also the question 
of whether the response to the item 
should in this computation be in- 
cluded in the total score. In order to 
circumvent these difficulties with 
Long's index, Loevinger proposes the 
modification 


2 **nasses”’ below or tied with “fails” 
J 


Hy=1- 


cross breaks have to be made to 


compute the P,;s. However, Jackson's 
method is also laborious since it re- 
quires an initial posting of the entire 
response matrix. 

The sampling distribution of H; is 
unknown, and Loevinger advises that 
it should not be used as an estimate 
of homogeneity unless the sample 
of subjects exceeds 100. 


Homogeneity of an Item with a Test 
CIT) 


Loevinger's I], yields an index for 
the test as a whole, but does not pro- 
vide an index of the homogeneity of 
each item with the test. For this 
purpose, she suggests another index, 
(H,,), the logic of which is the same 
as that employed in H;. In a per- 
fectly homogeneous test, subjects 
passing a given item should have 
higher total scores than those failing 
the item. The starting point is a 
formula developed by Long (15). 


Long’s Index 
1 2 pe “‘passes”’ below “‘fails” 
=: PO 


[13] 


In 13, the numerator is two times the 
number of subjects passing a given 
item who have total scores lower than 
those of subjects who failed the same 
item, and the denominator is the 


PO— >> “passes” one above “fails” 


[14] 


It is clear that this index can take 
values from minus to plus unity, but 
it is not clear that a zero value is ob- 
tained when there is no relation be- 
tween an item and the total test. The 
sampling properties of the index are 
unknown and will have to be in- 
vestigated to establish the value to 
be expected for a chance relation. 
The obtained 7, values for the illus- 
trative test may be seen in the last 
column of Table 6. 


GREEN'S SUMMARY STATISTICS 
MetuHop (J) 


Green (4, 5) has recently developed 
a method for computing an index of 
consistency for a test (J) which has 
all the advantages of Jackson's PPR, 
and Loevinger’s H;, plus greater ease 
of computation. Like Jackson's 


PPR, I is given by 


eb, 
[=— —_——— , [15] 
1.00— Rep ina 


where Rep is the obtained reproduci- 
bility of the test, Repina is the re- 
producibility which would be ob- 
tained with the same set of item diffi- 
culties and complete independence 
between items, and 1.00 is perfect re- 
producibility. 

Green's method of computing er- 
rors is the same as that employed in 
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10 above, except that the summation 
is not over all pairs of items, 1#j, but 
only over those item pairs whose 
members are adjacent in difficulty 
level. Green's reproducibility is given 
by 


k-1 
Sd MA, i+1 


Rep=1—-— 


1 
N K tj 


1 &? 
WE > Ming d+ 1,42 [16] 


where N is the number of subjects, 
K the number of items. Items are 
ranked in order of difficulty, the most 
difficult item receiving rank k, and 
the easiest item rank 1. The quantity 
m.i41 i8 the number of. subjects who 
both fail the ith item and pass the 
next most difficult item (¢+1). There 
will be k—1 such item pairs. The last 
quantity, mij 4.41442 is the number 
of subjects who have failed both item 
i—1 and i and passed both item 1+1 
and i+2. There will be k—3 such 
terms in this summation. 

The reproducibility that would be 
expected if the items had their ob- 
served difficulties, but were mutually 
independent is given by 


kl 
= 
be MIN jay 


1 
N°R gen 


1 *&2 
a MK »» Nin Niganz—. [17] 


Ref ina =i- 


These values for Rep and Repina 
are then put in 15 to obtain J, which 
will be unity for a perfectly repro- 
ducible test and zero for a test whose 
items are completely independent. 
Green suggests that J should be .50 
for a test before its items can be con- 
sidered scalable. Since this method 
makes only a partial count of the 
“errors” in a response matrix, it pro- 
duces a slight overestimate of re- 
producibility. In one empirical in- 
vestigation (5) it was found that the 
average discrepancy between Green's 


reproducibility and the exact re- 
producibility of ten scales was .002. 
Following a suggestion of Guttman 
(7), Green furnishes an approxima- 
tion to the standard error of Rep. 


/* — Rep)(Rep) 
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With this standard error it is possible 
to ascertain whether an obtained Rep 
is significantly larger than Repina. 
Green warns, however, that when 
such a test yields borderline sig- 
nificance, one should be cautious in 
interpretation since both Rep and 
Orep Are approximations. A high sig- 
nificance level does not necessarily 
indicate that the items are homoge- 
neous, merely that the item inter- 
correlations are significantly greater 
than zero. 

For the illustrative test, the com- 
putation of Rep, Repia, and TJ are 
shown in Table 4. 

The obtained Rep is .917, as com- 
pared with Jackson's .88, and .78 by 
Formula 10. The index of consistency 
(J) is seen to be .41, as compared with 
Jackson's PPR of .61, and Loevinger's 
H, of .23. 


[18] 


THE Put CoerriciENnt (,,)* 

A measure of item reproducibility 
can be derived from the phi coeffi- 
cient. This measure has the ad- 
vantages of an absolute maximum of 
1.00, an absolute minimum of 0.00, 
a known sampling distribution, and 
direct relationship to conventional 
test construction procedure. 

The logic behind the procedure is 
simple. Take as an example an item 
which 30 per cent of the subjects pass 
and 70 per cent fail. If the item is 
perfectly reproducible in a perfectly 
reproducible test, the 30 per cent of 
the subjects with the highest total 

* The writers find that Cronbach (1, p. 324) 


has anticipated them in this suggested manner 
of estimating reproducibility. 
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RePina= (7-246-344-443-642-7) 
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106) 5 + +-2-3-6-4) 

= 860 
I Rep—Repina .16—.860 


= 407 
1.00—Reping 1.000—.860 





scores should all pass the item; the 
70 per cent with the lowest total 
scores should all fail the item. Sub- 
jects can easily be ranked on total 
score and this distribution cut in the 
same ratio as the pass-fail ratio on 
any particular item being evaluated. 
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It is then simple to determine the 
number of persons high on total score 
who pass the item, the number of 
high persons who fail the item, the 
number of low persons who fail the 
item and who pass the item. The 
data may be put in a fourfold table as 
in Table 5. 


TABLE 5 
Irem-Tora Score Put Corrricient (,,) 
Total Score* | 
| Total 
| Low High 


Pass Item i | A | B |A+B 


Item ~ 
Score FailItemi | C 
Total |A+C B+D| N 


* Total score distribution is broken so that number 
of subjects in low group is equal to number failing 
item §: (C+D -A+C) 


Obviously, one has only to deter- 
mine the marginal sums (which are 
determined by the pass-fail ratio of 
the item) and one of the cell frequen- 
cies, since the rest can be computed 
by subtraction from the marginals. 

Splitting subjects on the basis of 
total score in the same ratio as the 
pass-fail split on an item may pro- 
duce a problem if several subjects are 
tied for total score across the cutting 
points. The tied subjects should be 
randomly distributed between the 
high and low groups so that the total 
scores are split in the same ratio as 
the pass-fail ratio. Take as a simple 
example the case in which 100 sub- 
jects have answered a questionnaire 
in such a manner that the pass-fail 
ratio on a particular item is 30/70. 
To evaluate this item, the subjects 
must be split on total score so that 
the highest 30 per cent constitute one 
group and the lowest 70 per cent 
constitute the second group. If three 
persons are tied for rank 30 in total 
score, two will be arbitrarily con- 
sidered ranks 29 and 30 respectively, 
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and will be placed in the high group. 
The third person will be assigned 
rank 31, and, despite the fact that 
his score is the same as that of two 
subjects in the high group, he will be 
placed in the low group. If the total 
number of subjects is reasonably 
large, and if the number of subjects 
having the critical lied score is not a 
large percentage of the total number 
of subjects, this will not distort the 
resulting phi. 

Since the marginals for the total 
score have been determined in a man- 
ner that forces them to be equal for 
the marginal for the particular item 
the usual phi formula can be simpli- 
fied to 


BC—AD 


(A+B)(C+D)- 


ou=- [19] 


where the quantities A, B, C, and D 
correspond to cell entries in Table 5 
above. 

The null hypothesis for such a phi 
coefficient is, in every case, that the 
obtained phi is not significantly 
greater than zero. This can be tested 
by a chi square or a Fisher exact test 
on the fourfold table. 

If the investigator desires to 
“purify” his test, he must choose a 
cutting point and select all the items 
with phi coefficients above this cut- 
ting point to constitute his repro- 
ducible scale. New total scores can 
then be computed on the basis of the 
selected items, and phi coefficients re- 
calculated to give an estimate of the 
reproducibility of the new scale. The 
coefficients for some of the items not 
included in the new total score may 
be so high that these items can be in- 
cluded in the scale, while those for 
some of the included items may drop 
to a level which makes it advisable to 
exclude them. 

Unlike some indices of reproduci- 
bility, this index is not affected by 
extremes of item difficulty. This is 


true because phi is not an index of the 
frequency in one cell, but is deter- 
mined by the intercorrelation be- 
tween cells. The method has a dis- 
advantage, a purely aesthetic one, 
but one that may prejudice some 
workers against it; the phi coefficients 
so obtained are not likely to yield 
many values in the .80’s or .90’s. The 
phi coefficients computed for the 
items in the illustrative test are 
shown in Table 6, where they may be 


TABLE 6 


Item-Totat Score Pat Coerricients 
FOR ILLUSTRATIVE Six-IteEM Test 


Hi 


1.000 
714 
333 
619 

867 
889 


compared with those computed by 
Jackson’s PPR, and Loevinger’s Hy. 

Though this method of computing 
a phi coefficient between a test item 
and the total score has the advan- 
tages of a known sampling distribu- 
tion, absolute maximum and mini- 
mum values, and freedom from re- 
strictive distribution assumptions, it 
does not furnish an index for the test 
as a whole. It is possible however to 
derive one by an averaging of the ob- 
tained phi coefficients. Such an ap- 
proach is shown in Formula 20, which 
Cronbach says is analogous to Gutt- 
man’s formula for reproducibility. 

k 


1 
R= K ) 1 —2piqi(1 — di). [20] 


Cronbach explains (1, p. 324): 


The correlation of any two-choice item 
with a total score on a test may be expressed 
as a phi coefficient, and this is common in 
conventional item analysis. Guttman di- 
chotomizes the test scores at a cutting point 
selected by inspection of the data. We will 
get similar results if we dichotomize scores at 
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that point which cuts off the same proportion 
of cases as pass the item under study. [Our 
oy will be less in some cases than it would be 
if determined by Guttman’s inspection pro- 
cedure.| Simple substitution in Guttman’s 
definition... leads to [Formula 20 above] 
where the approximation is introduced by the 
difference in ways of dichotomizing. The 
actual R obtained by Guttman will be larger 
than that from [this formula]. 


In our example the value turns out 
to be .80 as compared with the repro- 
ducibility figure of .88. 

This composite index for the entire 
test will have a maximum value of 
1.00 and a chance value of 


{ 
ao 3? ote 
ae pg 


which approaches .50 as the average 
item difficulty approaches 50 per cent. 
The sampling distribution of this 
statistic, to our knowledge, is not 
known. 


DISCUSSION 


This concludes the ,exposition of 
the major methods which have been 
put forward to give an index of the 
reproducibility of tests. Of those 
which yield indices for the test as a 
whole, several meet serious objections 
which have been leveled at Gutt- 
man's scalogram analysis. The tech- 
niques of Jackson, Loevinger, and 
Green are all objective, and result in 
measures which are not affected by 
the distribution of item difficulties. 
All have the same underlying ra- 
tionale, but differ slightly in the way 
in which “errors” are counted. 
Loevinger's H, is the most conserva- 
tive of the three since all possible er- 
rors are counted; Jackson's PPR, is 
the least conservative, and Green's J 
will usually fall between the two. 
The principal and not inconsiderable 
advantage of Green's technique is 
ease of computation, an important 
factor when the number of subjects 
and test items is large. Green's tech- 


nique is the only one discussed that 
gives an estimate of significance for 
the reproducibility of the entire test. 

Of the methods for computing the 
homogeneity of an item with the total 
test, the phi coefficient seems the 
most desirable because computation 
is easy and because the significance 
level of the obtained statistic can be 
determined exactly. Almost any of 
the commonly used item-analysis sta- 
tistics—point biserial, biserial, or 
Flanagan correlation  coefficient— 
may of course be interpreted as an 
index of item reproducibility, since in 
a reproducible test any person pass- 
ing a given item will pass more other 
items than a person failing that item. 
They differ from the phi coefficient 
mainly in the number of assumptions 
they impose upon the data. Those 
employing conventional item-analysis 
statistics have been quite willing to 
assume an interval scale and a dis- 
tribution function, usually normal, 
while those working within the frame- 
work of the concept of reproducibility 
have in general foresworn the unit of 
measurement and have thus con- 
fined themselves to distribution-free 
statistics, 

All the reproducibility indices rest 
upon the same assumption that in a 
reproducible or homogeneous test, 
one can reproduce the entire response 
pattern of passes and fails, given the 
total number of items correct, and the 
item difficulties. All the methods em- 
ploy the same data in the response 
matrix. All agree that in the response 
matrix of the perfectly reproducible 
test there will be no instances in 
which a subject passes an item more 
difficult than one he has failed. This 
is equivalent to saying either that all 
interitem covariances are maximal, 
or that the variance in total scores is 
maximal. Conversely, the test with 
lowest reproducibility will exhibit 
zero interitem covariances, and mini- 
mal variance in total scores. 
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Reproducibility and Factor Analysis 


It is obvious that the phi coefficient 
method of determining the homoge- 
neity of an item with total test is very 
similar to the procedure in classical 
test construction for “purifying” a 
test. 

A common procedure for evaluat- 
ing an item in conventional test con- 
struction is to compare the number 
of subjects passing the item among 
the 27 per cent of the sample making 
the highest total scores as opposed to 
the 27 per cent making the lowest 
total scores. A ‘“‘good" item is one 
that discriminates between these 
highs and lows. Consequently, the 
items which would be chosen as pro- 
ducing the most reproducible scale in 
the phi procedure for obtaining re- 
producibility would also be selected 
as the most discriminating in conven- 
tional test statistics. This point is 
important when considering the re- 
lationship between reproducibility 
and factor analysis. 

Several authors have been con- 
cerned with the question of the rela- 
tionship between reproducibility and 
factor analysis. Loevinger (14) has 
stated that factor analysis and re- 
producibility are unrelated. Hum- 
phreys (9) appears to agree with Loe- 
vinger on this point and attacks 
reproducibility for not being as satis- 
factory a tool for research as factor 
analysis. He feels that reproducibil- 
ity will lead to a confusing multiplic- 
ity of tests, while a factor analytic 
approach will not. Humphreys uses 
the hypothetical case of the problems 
involved in constructing a mechanical 
information test. The criterion of 
reproducibility, he fears, would re- 
quire the construction of separate 
tests for the cross saw, the brace and 
bit, the pipe wrench, etc. On the 
other hand, all these tests would 
probably appear on a single common 
factor that would be orthogonal to 
other factors. 


The writers disagree with both 
Loevinger and Humphreys, feeling 
that reproducibility and factor anal- 
ysis are closely related. This relation- 
ship can be made obvious by con- 
sideration of the Wherry-Gaylord 
iterative analysis (19). This is a 
method for discovering homogeneous 
groupings of items in a test. It in- 
volves correlating each item with the 
total score. Items with the highest 
correlations are selected and the test 
rescored on the basis of these items. 
All the items are then correlated with 
the new total scores. This procedure 
is continued until a stable group of 
items is extracted. These items con- 
stitute a single factor. The remaining 
items can be rescored and additional 
factors extracted. The first factor re- 
moved would be the general factor. 
As can be seen, the phi method of ob- 
taining reproducibility corresponds 
very closely to the Wherry-Gaylord 
extraction of the general factor. The 
principal differences are that the 
Wherry-Gaylord does not require 
that the finally selected items have a 
range of item difficulties, and does 
not cut total scores at the same ratio 
as the item-difficulty levels. Evidence 
reported by Wherry, Campbell, and 
Perloff (20) suggests that the Wherry- 
Gaylord general factor will corre- 
spond to the general factor obtain- 
able in a Thurstone multiple factor 
analysis. The present writers found 


similar evidence in an analysis of a 


morale scale. After the morale scale 
had been subjected to a Thurstone 
multiple factor analysis, it was ad- 
ministered to a new group of subjects 
and subjected to a phi reproducibil- 
ity scaling. The resulting scale was 
almost identical in item content with 
the Thurstone general factor. 

While it appears to be true that a 
highly reproducible scale will tend to 
measure a single factor (since the phi 
analysis will isolate the general fac- 
tor in the test items), not all single 
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factor tests will be highly reproduci- 
ble scales. This is because a repro- 
ducible scale must have a range of 
difficulty levels if all persons are not 
to be forced into two categories: 
either all items passed or all items 
failed. The following example points 
up the reason this is true. If all items 
were at the 50-per-cent difficulty 
level, and if the test were perfectly 
reproducible, the 50 per cent of the 
subjects with the highest total scores 
would score correct on all items; the 
50 per cent of the subjects with the 
lowest total scores would score in- 
correct on all items. This restriction 
is not necessary for all single-factor 
tests. Single-factor tests can have all 
items at the same difficulty level and 
still have a wide range of total scores 
due to the almost inevitable presence 
of error variance in the items. Re- 
producibility is impossible in such a 
case. Despite this lack of reproduci- 
bility, the single-factor test might be 
quite adequate since, if two persons 
score high on a single-factor test it is 
because they are high in the factor, 
and the differential patterns of their 
responses must be irrelevant for pre- 
diction since the differential patterns 
must be a result of error variance and 
do not represent stable patterns. If 
the differential patterns were differ- 
entially predictive, the test could not 
be a single-factor test. In those situa- 
tions, therefore, where it is desirable 
to have all items at the same difh- 
culty level, reproducibility is usually 
not a useful approach. The excep- 
tion to this rule is the case in which a 
single discrimination is desired—e.g., 
pass vs. fail. In this case all items 
should have pass per cents which are 
proportional to the pass per cent de- 
sired for the whole test (16). 

In many practical test-construction 
situations, where the logic of the 
situation is not incompatible with re- 
producibility, it appears to the writers 
that obtaining a general-factor test 


through phi reproducibility is simpler 
than through a Thurstone multiple- 
factor analysis. In addition to the 
relative ease of computation, the set 
of items so obtained should form not 
only a single-factor test, but also a 
reproducible scale. 


Reproducibility and Reliability 


It is obvious that the techniques 
for computing so-called reliability co- 
efficients from a single test admin- 
istration employ exactly the same 
data which have been used to com- 
pute the indices of reproducibility 
described above. Cronbach (1) has 
already pointed out the intimate rela- 
tion of Guttman’'s reproducibility to 
the Kuder-Richardson Formula 20, 
which he has rechristened alpha. The 
key term in alpha is the ratio of two 
variances, > pq/a*s. As Loevinger 
points out (13, p. 31) }>pg gives the 
raw score variance which would be 
obtained from a test whose items were 
completely independent, (0°); and 
o*, is the obtained raw score variance. 
Loevinger’s Formula 11 has these 
same quantities in it, plus a third 
representing the raw-score variance 
of a test whose items were perfectly 
correlated. It should be noted that 
the lower limit of alpha is always 
zero, but the upper limit is dependent 
upon the distribution of item diffi- 
culties. The obtained alpha for our 
illustrative test is .47, and the upper 
limit of alpha for this set of item 
difficulties is .88. 

In order to make alpha independ- 
ent of the distribution of item diff- 
culties, Horst (8) has developed a 
formula which turns out to be identi- 
cal with Loevinger's 11, except for a 
correction term composed of the ratio 
of the maximal to obtained score 
variance. Since this ratio has a lower 
limit of 1.00, figures obtained by 
Horst’s method will 
larger than Loevinger’s except in the 
perfect case. The Horst formula for 


necessarily be 
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the reliability coefticient corrected for 
dispersion of item difficulties is given 
below 


way 


The striking similarity of the 
Loevinger Formula 11 and the Horst 
Formula 21 cause one to suspect that 
the difference between  single-trial 
reliability and homogeneity or repro- 
ducibility is more apparent than real. 

The critical difference between the 
“reproducibility” and the “reliabil- 
ity ' camps of test construction is seen 
most clearly in the ways they in- 
terpret their indices. When a test 
shows perfect reproducibility, it will 
also show perfect reliability by any 
of the formulas described so far. In 
order for this unlikely event to occur, 
several conditions must be met: all 
the items must be homogeneous in 
content, all subjects must be similarly 
constituted in the trait, attitude, or 
ability being tapped; and this trait, 
attitude, or ability must remain 
stable during the testing period. Any 
departure from these conditions will 
cause any of these measures to fall, 
and there is no way to tell on 
basis of the matrix alone 
what is amiss. An astute dropping of 
rows or columns from the matrix 
(subjects and/or items) will, of 
course, make things look better. In 
any event, a low figure indicates that 
considerable information will be lost 
in attempting to order subjects on a 
single linear continuum on the basis 
of their total scores. It is here that 
techniques such as Lazarsfeld’s latent 
structure analysis (12) may be used 
to determine the minimal number of 
dimensions (classes) needed to 
count for the information contained 
in a response matrix. With this tech- 
nique, a subject, instead of being 
given a total score, is assigned a 
probability of belonging in each of 
several classes. No unidimensional- 


the 
response 


ac- 
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ity is assumed, so there is no question 
of item or subject elimination to force 
unidimensionality, a procedure rou- 
tinely employed by those addicted 
to Guttman scaling and classical test 
construction. When any such item- 
elimination procedure is used in test 
construction, a _ reliability or re- 
producibility figure computed on the 
final sample of items cannot be 
evaluated until the new version of 
the administered to 
another sample of subjects. A low re- 
producibility figure is generally taken 
as an indication of item heterogeneity 
in a test, while a low reliability figure 
of the Kuder-Richardson variety is 


test has been 


usually seen as an indication of the 
presence of considerable error vari- 
ance. In the absence of other in- 
either interpretation is 
equally plausible, or suspect, since, as 
was pointed out above, the indices 
employ the same information from 
the response matrix. 


formation, 


Items and Subjects 

There is no reason why the tech- 
niques of computing reproducibility 
or single trial reliability cannot be re- 
versed to yield coefficients about the 
homogeneity of subjects, instead of 
It is surprising that this 
has not been done more often, espe- 
cially in the area of attitude measure- 
ment. Lack of reproducibility in a 
response matrix is just as likely to be 
due to heterogeneity in the popula- 


test items. 


tion tested, as to heterogeneity in the 
For most of the indices 
described above, computation of sub- 


test items. 


ject homogeneity would merely in- 
volve switching row and column 
marginals in the formulas. Such a 
technique would seem to be a promis- 
ing one for the identification of de- 
viants. 


Why Reproducibility or Single-Trial 
Reliability? 

Having come this far, it is high 
time we asked why a test with high 
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reproducibility or single-trial reliabil- 
ity is a good thing. Social scientists 
are all too prone to assume that it is, 
and to think no further about it. As 
Cronbach (1) has pointed out, re- 
producibility is in a sense a measure 
of the redundancy in a test. For 
many purposes, this is undesirable. 
Whenever test results are used to 
predict a dichotomous criterion such 
as hire-not hire, pass-fail, butcher 
-candlestickmaker, psychotic—nor- 
mal—in short to classify subjects—it 
can be argued that the last thing in 
the world a test should have is high 
internal consistency. The real need 
is a set of items highly related to the 
criterion but not to each other. This 
is, of course, a restatement of the 
multiple-correlation approach to pre- 
diction. Ideally each item would 


represent a different pure factor. In 
such a situation, interest lies not in 
ordering subjects on some linear hy- 
pothetical trait, attitude, or ability 


continuum, but in an efficient dichot- 
omization of the subjects or an or- 
dering on the basis of the probability 
of membership in a class. To the ex- 
tent that the test items are redun- 
dant, valuable testing time is wasted. 
It is a mistake to think such a test is 
“‘measuring’’ something, in the usual 
sense of that word. That a test can 
differentiate between neurotics and 
normals is no indication that “neurot- 
icism”’ is a trait on which people can 
be ordered in some simple fashion. 
Much confusion in clinical literature 
is based on this fallacy. Unless the 
instrument exhibits high homoge- 
neity-reproducibility—single-trial _ re- 
liability, there is no reason to as- 
sume that the score on the test can 
yield an ordering of the subjects 
on some unidimensional continuum 
which can be given a label. 

It is the person doing “basic” re- 
search who is apt to be more in- 
terested in ordering subjects on a uni- 
dimensional continuum. For him, 


the question of the internal consist- 
ency of his multiple-item test or 
questionnaire is of immediate con- 
cern. He may start with the un- 
shakable conviction that the trait he 
has in mind ¢s unidimensional, in 
which case he will engage in an often 
lengthy process of test construction, 
weeding out items until he achieves 
an instrument with internal 
sistency at a satisfactorily high level. 
This type of worker usually longs for 
an infinite population of items and 
subjects. When this longing is ful- 
filled, or even approximated, he can 
usually come up with a selection of 
items which, when administered to an 
appropriate population, will yield a 
response matrix of the desired in- 
ternal consistency. Ile may even re- 
gard this achievement as support for 
his initial assumption about the uni- 
dimensionality of the trait, though 
the logic of such a conclusion is some- 
what less than perfect, considering 
the amount of information thrown 
away in order to make things come 
out so neatly. 

On the other hand, he may begin 
with a more modest aim: to find out, 
for a given set of items, the minimum 
number of parameters needed to ac- 
count for the obtained responses of 
subjects to these items. If he finds 
that the response matrix shows high 
reproducibility or high single-trial 
reliability, he is apt to be pleased be- 
cause life is so simple; but if he does 
not find his data so neatly arranged, 
he is likely to resign himself to fairly 
laborious procedures in order to tind 
out the dimensionality of the data he 
has collected rather than to attribute 
any departure from unidimensional- 
ity to error variance. 

The important point is that all the 
techniques mentioned here, whether 
they are regarded as indices of re- 
producibility, homogeneity, or single- 
trial reliability, are based upon the 
same raw data in the response ma- 


coli- 
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trix; and all are more or less inter- 
changeable with a little algebraic 
manipulation, though, as we have 
seen, they yield different numbers. 
How the number is interpreted de- 
pends not upon which one of these 
formulas is employed, since they are 
all basically equivalent, but upon 
what assumptions are made. One 
can assume that the items are homo- 
geneous and that the subjects are 
similarly constituted in the trait 
being measured, in which case one 
uses the index as a measure of intra- 
individual trait stability. On the 
other hand, one can assume trait 
stability and subject homogeneity, in 
which case the index is said to reflect 
the homogeneity of the items. As 
was mentioned above, may 
equally well assume trait stability 
and item homogeneity and employ 
the index as a measure of the homo- 
geneity of the subjects. Any pair of 
assumptions appears to be about as 
plausible as any other. The impor- 


one 


tant point is that from a single re- 
sponse matrix there is no way of tell- 
ing what assumptions are reasonable. 
An obtained index, be it Jackson's 


PPR, Loevinger’s H,, Green's J, 
Cronbach's alpha, or Horst’s ra, will 
be less than 1.00 when any or all of 
these conditions are not met. The 
plausibility of the assumptions can 
be ascertained only by recourse to 
further data, and the kind of data 
required will be different for testing 
each assumption. An estimate of in- 
traindividual trait stability, for ex- 
ample, demands retesting the same 
subjects with the same items, but 
such retest data will be of little value 
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in arriving at estimates of subject or 
item heterogeneity. 

The one thing these indices of re- 
producibility or single-trial reliabil- 
ity will reflect without equivocation 
is the amount of information thrown 
away by representing the subject's 
performance on the test by a total 
score based on the number of items 
passed. They indicate, in other 
words, how adequately a unidimen- 
sional model fits the obtained data. 

Proponents of homogeneity or re- 
producibility have been criticized be- 
cause their criteria for a “‘good"’ test 
are unrealistically strict. It is true 
that perfect reproducibility will oc- 
cur when, and only when: (a) the 
factors determining subjects’ re- 
sponses to the test do not change 
during the testing period, (6) the 
factors determining subjects’ re- 
sponses to the test are the same for 
all subjects, and (c) all the items in 
the test are identical in the factors 
determining the responses they elicit. 
It is also true that perfect single-trial 
reliability will be obtained only under 
the same circumstances. These are 
stringent conditions, and they are 
seldom, if ever, met. Human beings 
are just not that simple, but the fault 
is hardly Guttman's. There is noth- 
ing wrong in continuing to assume 
that many human abilities, attitudes, 
and traits are unidimensional con- 
tinua, but we should be fully aware 
that this is at best a useful first ap- 
proximation, and that an appreciable 
proportion of the information in our 
raw data will thereby be sacrificed on 
the altar of error variance. 
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AFFECTIVE PROCESSES IN PERCEPTION! 


NOEL JENKIN?® 
The Training School at Vineland 


Much imaginative and productive 
enterprise has for several years been 
expended in studying perception as 
it is related to a range of motivational 
functions. Activity in this field has 
almost, but not quite, acquired the 
character of a contemporary “school” 
(20, 79). It is not surprising, there- 
fore, that this movement has become 
an object of critical attack as well as 
a focus for warm adherence. 

Some current evaluations of its 
achievements and limitations are far 
from unanimous. M. D. Vernon (123) 
points out that in many experiments 
the long term schemata of the ob- 
server, by far the most important of 
the nonstimulus determinants of per- 
ception, are given no opportunity to 
function. She also notes that the 
between “temporary 
need state” and perception has not 
been clearly established, and that 
even if the correlation exists, the re- 
sults often may be attributable to a 
short-term cognitive set based on the 
actual conditions of the experiment. 
Such results, therefore, have less im- 
portance and generality than some 
would wish to suppose. Henle (58) 
has stated that the finding of a cor- 
relation between motivational 


relationship 


con- 


ditions and performance on a cogni- 
tive task is only the first step toward 


solution of the problem. She finds 
thirteen possible ways in which needs 


' The author is indebted to Dr. D. W. 
MacKinnon, of the University of California, 
and to Dr. J. S. Bruner, of Harvard Univer- 
sity, for reading critically the first draft of the 
present manuscript. 

* This paper was prepared when the author 
was at Harvard University. 
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and attitudes may influence cognitive 
processes, including perception, and 
calls for research aimed at specifying 
the particular manner by which, in a 
given experiment, a motivational 
state may act upon the percept. 
Murphy (92) has re-emphasized the 
importance of ‘“autism,”’ and of the 
learning process, in better under- 
standing ‘the nature of perceptual 
dynamics. Prentice (109) who has 
remained disenchanted by the broad 
“functionalistic’” definition of per- 
ception, gestures invitingly toward 
the paradigm of the psychophysical 
experiment, and tells once more the 
story of the supposed failure of the 
new movement to explain how cor- 
relations of need and perception are 
mediated. That such correlations 
even exist is doubtful, he feels; they 
are ‘‘so hard to demonstrate.” 
Reflection of this nature may well 
leave many a reader with a sense of 
bewilderment concerning the field 
which is the target of such commen- 
tary. Since an ordered consideration 
of the data would probably yield 
clarification, it is the purpose of the 
present paper to organize a summary 
of the principal findings of recent 
years, and thus provide the most 
relevant materials for an assessment 
of the evidence. Two restrictions will 
be placed on the field to be covered. 
First, since the literature prior to 
1949 has already been the subject of 
adequate review (13, 25, 26, 29, 98, 
99), work done prior to this date will 
be omitted or dealt with by brief 
allusion. Second, in view of the 
magnitude of the area to be covered, 
work dealing with the perception of 
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other persons, with perceptual typol- 
ogy and with “perceptual attitudes” 
or broad syndromes of related per- 
sonality and cognitive functions, will 
be omitted. It is not thereby implied 
that this work is irrelevant or unim- 
portant, but rather that a survey of 
the latter fields would best be under- 
taken as a separate task. The ma- 
terial reviewed will be grouped into 
four arbitrarily selected categories. 
The first consists of studies in which 
size judgment has constituted the 
dependent variable. The second is a 
group of investigations in which a 
physiological need (hunger or thirst) 
has been studied in relation to some 
perceptual activity. The next area 
to be considered will be the relation 
of positive values to perceptual be- 
havior, followed by a review of work 
on the perception of noxious or 
threatening stimuli. A final section 
will deal briefly with the implications 
of the research previously reviewed, 
in relation to the problem of defining 
“perception.” 


STUDIES OF SIZE JUDGMENT 


Little space is required to discuss 
the well-known Bruner and Goodman 
study (25), or the equally familiar 
counterfire by Carter and Schooler 
(34). The former found an enhance- 
ment of size which could be at- 
tributed to the desire for money, and 
“explained” in terms of perceptual 
accentuation. The latter found no 
such perceptual effect. Further work 
in different laboratories resulted in 
further inconsistencies. Bruner and 
Postman (27) found that tokens con- 
taining a “positive’’ symbol were 
judged larger than those containing 
a “negative” (unpleasant) one, and 
that both kinds of tokens were 
judged larger than the one containing 
a neutral design. Klein, Schlesinger, 
and Meister (65) in a similar type of 
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experiment, failed to find any effect 
on size judgment from the affectively 
stimulating symbols inscribed on the 
stimulus objects. Solley and Lee 
(116) have recently reported further 
data, showing that when the stimulus 
figures are matched for closure, a val- 
ued object (in this case the object 
bearing a dollar sign) is judged sig- 
nificantly larger than neutral objects. 
No significant difference was found 
between judgments of the swastika 
and the neutral figure. 

A different experimental design 
was used by Lambert, Solomon, and 
Watson (69). Judgments by young 
children of a disc, originally neutral 
in significance, showed an apparent 
enhancement and then a diminution 
in size, as the conditions of reinforce- 
ment and extinction were manipu- 
lated by the experimenters. Con- 
tinuing this type of work, Lambert 
and Lambert (70) found similar re- 
sults. Another study by Bruner and 
Postman (24) induced a different 
kind of affective state. While experi- 
encing an electric shock, the Ss gave 
relatively accurate size judgments. 
Immediately after shock, however 
(the Ss now in a state of “tension- 
release’), significantly larger judg- 
ments were given. 

A new approach to the question of 
value and the perception of size was 
made by Ashley, Harper, and Run- 
yon (4). Fictional life histories of 
“poverty” and “wealth” were in- 
duced in hypnotized Ss, and _ sig- 
nificant results were obtained in the 
direction of those reported by Bruner 
and Goodman. An extensive investi- 
gation by Bruner and Rodrigues (30) 
returned to the area originally studied 
by the senior author (25), and at- 
tempted to resolve the differences 
between the Bruner and Goodman 
and the Carter and Schooler results. 
Differences in procedure between the 
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two studies had involved, first, shape 
of the variable patch of light which 
had served to obtain the measure, 
and, second, placement of the coins 
and discs which were to be matched. 
A further possible difference was that 
one group of Ss may have adopted a 
set toward accuracy more strongly 
than the other group. The new ex- 


periment by Bruner and Rodrigues 
selected variables to represent these 


differences in a design intended to 
test the hypothesis that the value of 
objects will produce a constant error 
in the judgment of their size. It was 
found that coins were judged sig- 
nificantly larger than equivalent 
cardboard dises, and that there was 
no significant difference between the 
size estimates of and 
spondingly-sized metal discs. 
analysis represented the study of 
“absolute’’ accentuation, as in the 
earlier studies. A new method of 
computing the differential, termed 
“relative” accentuation, yielded the 
finding that as the value (and size) 
of coins is increased, the extent of 
overestimation increases significantly 
more markedly than is the case with 
metal or cardboard discs. Contrary 
to the prediction, “accuracy set” 
produced overestimation of both 
coins and dises, relative to perform- 
ance of the group “value set."’ This 
finding refers to the condition where 
the objects to be judged were placed 
on the table before S. In contrast, 
when coins were held in the hand, 
greater relative accentuation was 
found for the value-set than for the 
accuracy-set condition. No signifi- 
cant difference was found between 
any of the three types of variable 
light patch used. 

Lysak and Gilchrist (81) attempted 
to test the generality of some of the 
Bruner and Goodman findings, using 
a design which departs from the 


coins corre- 


This 
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former experiment in two important 
respects. First, adult Ss were used, 
and second, paper currency rather 
than coins was employed. A pre- 
liminary experiment established the 
fact that objects of the same size and 
shape as U. S. paper currency are 
progressively overestimated in size as 
a function of complexity of the im- 
pressed design. A second experiment 
found that there was no significant 
difference between the judgments of 
the size of a control rectangle and the 
judgments of one, five, and ten dollar 
bills, and that there was no trend 
toward increasing overestimation of 
the bills as their value increased. A 
group to whom the bills were “avail- 
able”’ (i.e., who were told that they 
would be given the money if their 
judgments were accurate) made 
slightly larger judgments than the 
group to whom the bills were un- 
available. Groups which judged the 
bills from memory, five minutes after 
being shown them, made smaller 
judgments than did the groups 
making matches with the bills in 
view. This was contrary to the pre- 
diction based on the Bruner and 
Goodman results. A third experi- 
ment, with a larger group of Ss, con- 
firmed these findings. 

In most of the reported work on 
size estimation in relation to motiva- 
tion, the former measure has been 
achieved by employing the method of 
mean error, or some adaptation of 
this classical technique. Dukes and 
Bevan (42) departed from this typi- 
cal procedure and gave a recognition 
test after previously exposing Ss 
to a gambling situation in which they 
won or lost a sum of money. Toa 
significant degree, the greater the 
sum of money either won or lost, the 
greater was the size of the object 
chosen as matching the critical object 
which represented the extent of the 
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winnings or losses. A brief lapse of 
time took place between secing the 
object and making the match. A 
further experiment by Dukes and 
Bevan (41) departs not only from the 
method of mean error but also from 
the otherwise uniform concern with 
the visual modality. With a modified 
method of constant stimulus dif- 
ferences, children made a series of 
judgments of weights. ‘Valued”’ 
weights were constructed by filling 
jars with candy, and “neutral” 
weights by filling jars with sand and 
sawdust. It was found that the 
valued objects were to a significant 
extent judged as heavier than the 
“neutral” objects. 

Yet another technique was used by 
Beams (8) who selected child Ss on 
the basis of their strong preference or 
dislike for certain kinds of food. A 
projected image of the food object 
was adjusted by the S until it ap- 
peared equal in size to the actual food 
object. “Fhe stimulus object and the 
matching projection were alternately 
monocularly observed. Larger judg- 
ments of the favored type of object 
were found in highly significant 
degree. 

The problem of systematic indi- 
vidual differences in proneness to the 
size-enhancement effect has received 
relatively little attention. A prelimi- 
nary report by Klein (63) gives em- 
phasis to an earlier argument (64) 
about the importance of this area 


The Ss were preselected on the basis 


of performance on the Stroop color- 
word test, and thus were classified as 
“high-interference”’ and “low-inter- 
ference’ groups. Thirsty Ss of the 
high-interference group, compared 
with satiated controls, underesti- 
mated the size of discs displaying 
thirst-related symbols. Overestima- 
tion was shown by the thirsty low- 
interference group. For combined 
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thirst versus combined satiated 
groups the mean difference in per- 
formance was negligible. 


Discussion 


A survey of the findings and argu- 
ments, as they have in this area de- 
veloped over the past nine years, 
leads first to the conviction that de- 
sign and technique in such experi- 
ments has progressively improved. 
Perhaps some early faults have been 
replaced by other, and subtler, errors. 
It is noteworthy, however, that the 
weight of evidence favors the propo- 
sition that value and need are de- 
terminants of size judgment. Not 
only do the experiments with positive 
findings outnumber the negative 
ones; it can also be said that most of 
the recent and best-controlled experi- 
ments are among those with a posi- 
tive outcome. 

In an early critique, Pastore (97) 
was able to argue that the relative 
overestimation in the Bruner and 
Goodman experiment may be a func- 
tion of the size of the coin rather than 
of its value. This cannot be said of 
the more recent Bruner and Ro- 
drigues experiment, with their new 
method of calculating relative ac- 
centuation. Nor can it be said of 
various other findings, which em- 
ployed stimuli other than coins. 

Reviewing some of the earlier 
work, with its puzzling and vexatious 
inconsistencies, Bruner and Postman 
(28) were prompted to ask: “What 
kinds of constraints operate in the 
stimulus field which enhance or in- 
hibit the operation of directive fac- 
tors? And what kinds of instructional 
or motivational cqnstraints operate?” 
Implied here is the suggestion that 
there were unrecognized variables 
present in the earlier experiments 
which were uncontrolled and which 
led to the inconsistencies between 
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different studies. The Bruner and 
Rodrigues experiment achieved some 
success in defining and holding con- 
stant a greater range of variables. 
Under these conditions it was still 
possible to conclude that the value of 
objects affects their phenomenal ap- 
pearance. Lysak and Gilchrist (81) 
however, holding constant the size 
of the object to be matched, obtained 
clearly negative results, and in at- 
tempting to account for inconsisten- 
cies in this field of investigation, they 
propose a developmental hypothesis. 
The progressive equivocality of ex- 
perimental findings as the age of Ss 
increases suggests that progress to- 
ward maturity brings an increasing 
ability to evaluate the physical en- 
vironment. Nevertheless, several of 
the findings reviewed above have 
shown the “accentuation” effect with 
adult Ss, and hence the developmental 
hypothesis, at least in its simple form, 
does not account for all of the pub- 
lished results. 

Gilchrist and Nesberg (52) ques- 


the appropriateness of proce- 
dures in which the standard and the 


tion 


variable stimuli have differed in 
dimensions other than that in which 
the Ss were required to make their 
matches. The experiment of Beams, 
cited above, attempts to avoid this 
difficulty, and in large measure prob- 
ably succeeds. A new difficulty is 
consequently introduced, however. 
With his method, the stimuli are no 
longer simultaneously present in the 
S's binocular field of vision. The 
brief transition from stimulus object 
to comparison object involves an 
interval of time which, though it is 
small, renders possible the objection 
that it is a memorial rather than a 
perceptual phenomenon which is be- 
ing studied. Whether or not this ob- 
jection is valid, the experiment seems 
certainly to confirm further the view 
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that accentuation effects as a func- 
tion of need and value shown 
most strongly when some degree of 
equivocality is present in the stimu- 
lus situation. It is possible, as Bruner 
points out (21), that optimal viewing 
conditions may eliminate the effect 
altogether. 

Finally, brief reference should be 
made to the oft-raised question of 
why valued or needed objects or those 
associated with pleasant affect should 
in some dimension be perceived as 
greater than neutral objects. Bruner 
and Rodrigues (30) suggest that the 
effect is due to the frequent pairing 
in the environment of value and 
size. An alternative suggestion is 
made by Dukes and Bevan (41). 
Their previous work (10, 41) had 
shown that accentuation effects, in 
the case of valued objects, are coupled 
with decreased variability of response. 
This led them to adopt an analogy 
from the field of electronics. Motiva- 
tional factors, such as needs and val- 
ues, may serve to “tune”’ the organ- 
ism to respond with high selectivity 
and amplification (accentuation), 
when it is in the presence of valued 
stimulus objects: When the receiving 
system encounters less valued 
jects, the perceiver responds to a 
wider range of stimulation, but at 
the same time sacrifices the degree of 
amplification which occurs with sharp 
selectivity. 


are 


ob- 


PHYSIOLOGICAL NEED AND 
PERCEPTION 


The factor of ambiguity in the ex- 
perimental situation has generally 
been minimized in the studies dis- 
cussed above. In this respect, they 
are to be distinguished from practi- 
cally all other experiments in the field 
under review. The contrasting type 
of work has sought to isolate per- 
ception-motivation relationships by 
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means which include either the redux 
tion of stimulation to some critical 
value or the broadening of a range 
of response probabilities, or both. 
Typical of this approach is a group of 
experiments in which either the hun- 
ger or thirst drive has been manipu- 
lated as an independent variable. In- 
fluenced by the work of Sanford (114), 
several of these studies used what has 
been styled as a “projective’’ tech- 
nique (129). 

Sanford reported that the number 
of food responses made by hungry 
Ss in several different situations in- 
creases in a negatively accelerated 
manner. This finding provides a con- 
text for the consideration of studies 
more directly and explicitly dealing 
with the perceptual process. The 
same is true of several other studies, 
including an experiment by McClel- 
land and Atkinson (84). These in- 
vestigators also found that responses 
of a food-related character increased 


in number as the period of food dep- 


rivation lengthened. The “projec- 
tive’ technique in this experiment 
achieved the ultimate in stimulus am- 
biguity; in some of the trials a com- 
pletely blank screen was used and the 
Ss were led to believe that faint vis- 
ual cues were present. In a second 
experiment (5), TAT stories were 
evoked under different degrees of 
hunger motivation. Although a de- 
crease occurred in the frequency of 
references to eating as the interval of 
deprivation lengthened, certain other 
trends were noted. Plots tended to in- 
volve the desire for food or activities 
designed to remove the obstacle in the 
way of hunger satisfaction. 

From these experiments, it is possi- 
ble to draw a conclusion that food- 
deprivation has a marked effect upon 
cognitive processes, possibly includ- 
ing perception. This generalization 
must be qualified, however, by the 
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negative outcome of the experiment 
of Brozek, Guetzkow, and Baldwin 
(18). The hunger need was aroused 
in this experiment more drastically 
than in any other work reported. A 
state of semistarvation was main- 
tained in the Ss for 24 weeks. De- 
spite very clear “clinical” evidence 
that thoughts of food came to pre- 
occupy, and indeed even to obsess the 
minds of the Ss, little or no relation- 
ship was found between the inde- 
pendent variable and the projective- 
test measures employed, including 
the Rorschach and the Rosenzweig 
Picture-Frustration Tests. One re- 
sult, separately considered, was of 
statistical significance. The experi- 
mental group made a higher mean 
percentage of idiosyncratic responses 
per word to eight food words from the 
Kent-Rosanoff list. 

Also within this context of work 
closely related to hunger and percep- 
tion, is the experiment of Postman 
and Crutchfield (105). They re- 
quired Ss to supply missing letters in 
words which offered opportunity for 
completion as food or nonfood words. 
The degree of hunger in the Ss, the 
degree of ambiguity of the words, 
and the degree of selective set for re- 
sponding with food words were sys- 
tematically varied. For the most hun- 
gry of the groups, the increase in food 
responses as a function of degree of 
set was positively accelerated. For 
the nonhungry group the increase 
was negatively accelerated. As dep- 
rivation was prolonged, there was 
a decrease in the number of food re- 
sponses to the least ambiguous stimu- 
lus words and an increase in food re- 
sponses to the most ambiguous words. 
Michaux (90), using a technique simi- 
lar to that of Postman and Crutch- 
field, confirmed his prediction that a 
group of persons with no history of 
mental disorder would show when 
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hungry an increase of “apperceptive 
emphasis on food,”’ while a group of 
schizophrenic patients would, when 
hungry, fail to manifest this trend. 
The finding is interpreted in terms of 
a defect of “‘psychological homeosta- 
sis.”” 

Work showing or attempting to 
show a relationship between hunger 
and perception, as distinct from re- 
lated cognitive processes, appears to 
have begun with the experiment of 
Levine, Chein, and Murphy (76). 
Ambiguous drawings were presented 
to hungry Ss, who yielded more food- 
related responses than were given by 
a control group. The food-related re- 
sponses to achromatic drawings in- 
creased in number after three hours 
of food deprivation, and increased 
still further after six hours, but de- 
creased after nine hours of depriva- 
tion. For chromatic pictures, the in- 
crease occurred at three hours and the 
decrease at six hours 

Gilchrist and Nesberg (52), in at- 
tempting to secure an unequivocal 
answer to questions about the rela- 
tionship of need and_ perception, 
abandoned the “projective’’ method, 
and embarked on a strictly controlled 
series of experiments in which hunger 
and thirst were the independent vari- 
ables. Their Ss for 15 seconds ob- 
served the projected images of food 
and drink objects immediately after 
a meal, and again at 6 hours and at 20 
hours after eating. The light was 
switched off for 10 seconds and the S 
was then required to adjust the 
brightness of the image to the degree 
of illumination previously seen. Hun- 
gry Ss made significantly brighter 
matches than the controls, and this 
effect increased as a function of the 
time of deprivation. A second experi- 
ment, using thirst as the independent 
variable, found similar results. The 
illuminance matches of the experi- 
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mental group rose steeply with the 
period of deprivation, then showed 
some negative acceleration. Since the 
control group showed no significant 
change, the interaction reached a 
high order of significance. Two addi- 
tional experiments confirmed these 
results while controlling for the possi- 
ble influence of stimulus factors ir- 
relevant to the thirst need. From the 
four experiments, it was concluded 
that support is given to the proposi- 
tion: ‘Increasing need gives rise to an 
increasingly positive time error in the 
illuminance matches of objects rele- 
vant to that need.”’ 

A report by Lazarus, Yousem, and 
Arenberg (75) criticizes some earlier 
work in this field for failing to define 
perception in terms of the identifica- 
tion of objective stimuli. Two kinds 
of interrelated perceptions are pro- 
posed, one which seems to be more 
oriented toward imagination, associa- 
tion or projection, and one which is 
more stimulus oriented. In order to 
study “perceptual behavior in the 
strictest two experiments 
were conducted in which unequivocal 
pictures of food and nonfood objects 
were shown for one-fifth of a second 
at progressively increasing degrees of 
illumination. The Ss were free to 
guess at the identity of the objects. 
Thresholds for food recognition, rela- 
tive to thresholds for nonfood recog- 
nition, decreased at 2 hours and at 
approximately 4 hours after depriva- 
tion but increased sharply at 6 hours. 
A replication of the experiment gave 
similar results. A further experiment 
(75) was identical in design, 


. ” 
sense, 


except 
that a forced-choice technique was 
used with a limited range of alterna- 
tives always before the S. In this sit- 
uation, no significant relationship was 
found between hunger and the per- 
ceptual recognition score. This is in- 
terpreted as evidence “that need in 
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perception relationships depends on a 
wide range of response opportuni- 
ties."” In both experiments, a study of 
prerecognition guesses yielded no 
support for the hypothesis that re- 
sponse availability would account for 
the relationship between hunger and 
perceptual recognition. 

The three perceptual studies dis- 
cussed above have uniformly em- 
ployed stimuli directly representing 
food and drink objects. Their posi- 
tive outcome might prompt the ques- 
tion as to whether the effect of need 
would still manifest if, instead of di- 
rect representation, linguistic sym- 
bols for food and drink were used. A 
positive answer is offered by Wispé 
and Drambarean (129). Two groups 
of Ss deprived of food and water were 
compared with a nondeprived con- 
trol group as to their respective recog- 
nition thresholds for ‘‘neutral’’ words 
and words related to hunger and 
thirst. In order to study the effect of 
different frequencies of word usage, 
two lists of need-related words were 
separately given prior standardiza- 
tion and matched for frequency, one 
a list of common words and the other 


a list of uncommon words. Lowered 


thresholds for the deprived groups 
were found to a significant degree, for 


both common and uncommon need- 
related words. The relationship was 
not linear, as shown by the fact that 
the thresholds after a 24-hour interval 
were not lower than those after a 10- 
hour interval. A study of the pre- 
recognition found that 
words relating to food objects and to 
acts instrumental to need-satisfaction 
increased in frequency at 10 hours 
and decreased at the 24-hour interval. 

A recent experiment by Taylor 
(121) along very similar lines found 
results which conflict with those out- 
lined above. Degree of need and de- 
gree of ‘“‘set’’ in the Ss were both ma- 


responses 
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nipulated, half of the satiated and half 
of the nine-hour deprived groups both 
being given instructions which led 
them to expect words referring to 
food or beverages. 


“ee ” 


Those Ss given 

set instructions showed lower 
thresholds for need-related words 
than did the nonset groups, but there 
was no significant difference between 
the thresholds of the deprived and 
satiated groups, even for the subjects 
not given “‘set’’ instructions. A repli- 
cation of the experiment with a dif- 
ferent ordering of the stimuli showed 
the same negative results. 


Discussion 


Of the six reports upon hunger in 
relation to a broadly defined cogni- 
tive area, only one (18) was essentially 
negative, and this was the one which 
used stimuli largely irrelevant to the 
need-state studied. It seems from the 
empirical findings that when appro- 
priate stimulus objects are used, the 
presence of the hunger need facili- 
tates categorization in a manner con- 
with that need. The same 
findings compel us to note that “‘ap- 
propriate stimulus objects’ consti- 


sistent 


tute a large and varied class, ranging 
from incomplete words to a_ blank 
screen. Further, we must incongru- 
ously exclude from this the 
Rorschach blots and the Rosenzweig 
Picture-Frustration Test (cf. Brozek, 
et al., 18). Some measure of congru- 
ity of the stimulus object with the 
need is evidently one factor which 
produces the effect. On the other 
hand, a very high degree of ambigu- 
ity (e.g., a blank screen plus sugges- 
tions that cues are present) is also a 
favorable condition for the occur- 
rence of need-related responses. 

The crucial question for the present 
discussion must ask if the relationship 
between need-state and such proc- 
esses aS imagery, association, and 


class 
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problem solving extends also to the 
perceptual process itself. From the 
results of four out of the five percep- 
tual studies reviewed, an affirmative 
answer is indicated. The earliest of 
these (76), has been several times 
criticized on methodological grounds 
(2, 75, 97), but even with the exclu- 
sion of this instance, the weight of 
evidence seems to favor the hypothe- 
sis. Of the remaining studies, two 
(121, 129) used stimuli which were 
only indirectly representative of the 
goal objects (i.e., words rather than 
pictures) and their results are in con- 
flict. Two studies remain (52, 75) 
which offer clear evidence for a rela- 
tionship between need-state and per- 
ception. The principal independent 
variables in both groups of experi- 
ments were similar, yet they used en- 
tirely different measures of perform- 
ance—recognition thresholds for pic- 
tures versus illuminance matches of 
pictures. Considered in the context 


of research in the same general area, 


including the work on size estima- 
tion, these results are therefore con- 
vincing. 

As in the previous section, it is 
necessary to ask why need variables 
should apparently function as de- 
terminants of perception. A feature 
of possible significance in the solu- 
tion of this problem lies in the shape 
of the function graphically plotted 
between degree of deprivation and 
the dependent variable. Despite 
widely differing kinds of measures, 
all experiments having a positive out- 
come in this type of perceptual and 
perceptual-cognitive research have 
shown similar features. As hunger 
has increased, the plot has shown 
either a U curve or one of negative 
acceleration. This effect is at mini- 
mum, though still discernible, in the 
Gilchrist and Nesberg (52) experi- 
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ments and at a maximum in the 
Levine et al. (76) and Lazarus et al. 
(75) studies. In the Postman and 
Crutchfield study (105), it is true 
only for the “low-probability”’ list, 
i.e., words less likely to evoke “‘food”’ 
solutions. 

Relevant to this common finding 
is the hypothesis of McClelland (83) 
who proposes, in terms of personality 
theory, an explanation of the curvi- 
linear relationship. Motivation is 
presumed to have different effects at 
different intensity levels. When it is 
weak, in the ‘“wish-fulfillment”’ stage, 
goal images occur. As it increases, a 
“push toward reality”’ is experienced, 
in which deprivation imagery tends to 
replace goal imagery. Still further in- 
crease in motivation brings orienta- 
tion toward relief from anxiety rather 
than toward attainment of the origi- 
nal goal, and in this stage occurs “‘a 
kind of defensive goal imagery which 
is very different in its function from 
the goal imagery obtained with weak 
motivation.” 

An alternative explanation to that 
of McClelland is offered by Lazarus 
et al. (75), who follow a similar view 
proposed earlier by Sanford (114). 
Since sensitivity to the food objects 
has been shown to increase after 
about 3 or 4 hours of deprivation 
and then to level off or decrease, it is 
suggested that the perceptual curve 
follows the cyclical food habit. Sup- 
port for this hypothesis would re- 
quire evidence of an increase in sensi- 
tivity after the initial rise and fall. 
Experimenters (e.g., 52, 129) who 
have used longer periods of depriva- 
tion than did Lazarus et al. have not 
as yet reported any such recurrent 
rise in the function measured. 

Wispé (128) suggests that the in- 
crease in need-related associations oc- 
curs initially as a result of a “food 





AFFECTIVE PROCESSES IN PERCEPTION 


habit" rather than real tissue need. 
It is implied that the influence of the 
“food habit’’ declines as the period 
of deprivation is protracted, leaving, 
however, the state of physiological 
need to operate as the determinant of 
a higher than normal level of respond- 
ing with food associations. 

Much of the interest of the experi- 
ments on hunger and thirst in relation 
to perception lies in the possibility 
that changes in body chemistry are 
closely related to perceptual experi- 
ence (7). Test of a two-component 
theory, such as that proposed above, 
would require experimental separa- 
tion of the habit or appetite variable 
from the physiological need. The 
work of Beams (8), cited in the previ- 
ous section, suggesisa possible method 
for inclusion of the former variable in 
a multifactor design intended to an- 
alyze the complex functions which are 
evidently operating. 

Until such work is conducted, the 
experimental data dealing with re- 
duced thresholds for need-related ob- 
jects and symbols are best inter- 
preted, it seems, in a way which is 
probably inadequate and which may 
be only partially true. This postu- 
lates a learned association between 
need-state and need-related responses, 
with the consequence that the latter 
have an increased probability and 
facility of occurrence under appropri- 
ate drive conditions and in an appro- 
priate stimulus situation. Such an 
explanation does not readily explain 
the Gilchrist and Nesberg (52) find- 
ing of a positive time error for illumi- 
nance matches of need-relevant ob- 
jects. This type of perceptual ‘‘ac- 
centuation”’ is more akin to the kind 
of data considered under the heading 
“studies of size judgment” and might 
best be understood in the context of 
those experiments. 
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POSITIVE VALUES AND PERCEPTION 

The stimulus for a great deal of 
subsequent research was a paper by 
Postman, Bruner, and McGinnies 
(101), published in 1948. Entitled 
“Personal Values as Selective Factors 
in Perception,”’ this reported an in- 
vestigation prompted by the ques- 
tion: what does the individual con- 
tribute to perceptual selection over 
and above a healthy pair of eyes and 
the appropriate response mecha- 
nisms? Measures on the Allport- 
Vernon Study of Values were com- 
pared with Ss’ recognition thresh- 
olds for tachistoscopically exposed 
words which represented each of the 
value areas. The results were dis- 
cussed in terms of concepts de- 
veloped by Bruner and Postman (23) 
in a previous study—selective sensi- 
tization and perceptual defense. To 


these was added a new concept, that 
of value resonance, based on a study 
of the Ss’ presolution hypotheses. 


A special feature of the Postman, 
Bruner, and McGinnies experiment 
lay in the fact that the main inde- 
pendent variable was not a general- 
ized motivational state, assumed to 
be uniform for all Ss. Instead, a cer- 
tain range of individual differences 
(the Spranger values) was selected 
and the measures on the dependent 
variable (recognition of value words) 
were treated with respect to the or- 
dering of each individual S's value 
profile. This enterprising step opened 
up a new path for research, in which 
the focus upon perception was di- 
rected through the lens of personal- 
ity, rather than through the broadly 
defined motivational state. 

This innovation was adopted by 
subsequent investigators, among the 
first of whom were McGinnies and 
Bowles (87), who used as stimuli por- 
traits rather than words. A value was 
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attached to each face by telling the S 
in each case that: “This is a scientist 
(or artist, minister, etc.)."" The occu- 
pations thus denoted represented 
each of the Spranger values. The S’s 
score was the number of exposures 
necessary for him correctly to identify 
each of the faces. Correlation coef- 
ficients were computed for each indi- 
vidual S between his scores for iden- 
tifying each occupational representa- 
tive and his scores on the Allport- 
Vernon Study of Values. Negative 
correlations were obtained for 15 of 
the Ss; the remaining 9 were positive. 
A further analysis showed a close 
rank-order agreement between value 
rank and the total number of correct 
identifications on the first recognition 
trial. It was concluded that when the 
experimental design does not offer 
opportunity for reduced thresholds 
for valued stimuli, “selective sensiti- 
zation” manifests itself in greater 
ease of fixating visual symbols of pre- 
ferred values. Although the ma- 
jority of Ss learned more easily to 
recognize valued symbols, it was felt 
to be of some significance that a 
smaller group found this task more 
difficult. A parallel was seen in the 


similar findings of previous writers 
(23, 101) who interpreted special sen- 
sitivity to less valued symbols as 


—the antithesis 
of “perceptual defense.” 

Vanderplas and Blake (122), seek- 
ing to extend the general validity of 
the concept of perceptual sensitiza- 
tion, designed an experiment which 
varied from the Postman, Bruner, 
and MeGinnies study in that the 
auditory rather than the visual mo- 
dality was employed. The authors 
were able to demonstrate that audi- 
tory perceptual sensitization oper- 
ates differentially to raise or lower 
recognition thresholds in a manner 
consonant with individual values as 


selective vigilance” 
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defined and measured by an inde- 
pendent instrument. A small minor- 
ity of the Ss again showed the oppo- 
site trend, i.e., ‘vigilance’ for the 
words representing low-ranking val- 
ues. 

The conclusion reached by Post- 
man, Bruner, and McGinnies was 
questioned by Solomon and Howes 
(61, 117) who reinterpreted their re- 
sults on the basis of a word-frequency 
hypothesis. It was assumed first 
that “high valuation of a given area 
of interest is associated with a posi- 
tive deviation from the mean fre- 
quencies with which words in that 
area occur in general usage’’ (117). 
Second, it was proposed that the 
Allport-Vernon test itself can be con- 
sidered a measure of the frequency 
with which the S uses certain words, 
and that it is unnecessary to postu- 
late entities such as ‘‘values”’ in order 
to account for an S’s profile. The ex- 
periment reported by Solomon and 
Howes employed words of two cate- 
gories, relatively frequent and rela- 
tively infrequent, according to the 
Thorndike-Lorge count. For each 
value rank, the data showed lower 
recognition thresholds for the fre- 
quent words than for the infrequent 
words. Between value rank on the 
Study of Values and recognition 
thresholds, a statistically nonsignifi- 
cant trend was found in the direction 
reported by the former writers (101). 
These results were regarded as evi- 
dence consistent with the proposi- 
tions cited above. 

Postman and Schneider (104), 
after communication with Solomon 
and Howes, published simultane- 
ously with the latter authors, a 
further report on the same type of 
data. Their stimulus words were also 
grouped into categories of relatively 
high and relatively low frequencies of 
occurrence. Two differences from the 
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Solomon and Howes lists may be dis- 
tinguished. First, the words were 
classified as falling into the value 
areas by means of a consensus of 
three judges “thoroughly familiar 
with the test.’"” Second, the “un- 
familiar’’ words chosen were consid- 
erably Jess unfamiliar than those of 
Solomon and Howes. Hence, while 
the latter experimenters chose words 
such as “‘percipience,” “erudition,” 
“uncoerced” and “vignette,’’ Post- 
man and Schneider chose for the un- 
familiar list examples like 


“concep- 
‘dominant,” and “‘lit- 
In general, the results of 
the experiment confirmed the finding 
that high-frequency words are recog- 
nized more easily than low-frequency 
words and that for high-frequency 
words there is no systematic relation- 
ship between value rank on the All- 
port-Vernon test and recognition 
thresholds. For low-frequency words, 


tion,”’ “‘logic,’ 
” 
erature. 


however, the relationship with value 
rank was present and statistically sig- 
nificant, the direction of the trend be- 
ing again in the direction found by 
Postman, Bruner, and McGinnies. 
What seems to be an effective re- 
buttal of the Solomon and Howes ar- 
gument was presented by Adams and 
Brown (1) and Brown and Adams 
(17). In the latter paper, an experi- 
ment is reported which tests the Solo- 
mon and Howes hypothesis that re- 
sults from the Allport-Vernon test 
are accountable for in terms of word 
frequency. The test was revised in 
such a way that the alternatives in all 
areas except one were expressed in 
synonyms with a very low frequency 
of usage. Six forms were constructed, 


each one favoring frequency-wise a 
On the Solomon 
and Howes hypothesis, the S should 
choose the high-frequency words and 
thus achieve a high rank for the cor- 


different value area. 


responding ‘‘value.’’ Six groups of Ss 
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answered one form of the new test, 
and also the Allport-Vernon-Lindzey 
test. It was found that there were no 
consistent changes in scores on the 
value area emphasized for frequency, 
relative to the five other groups in 
the same value area of the new test. 
Further, correlations of the new test 
withthe Allport-Vernon-Lindzey scale 
remained significantly positive, not- 
withstanding the fre- 
quency in the six versions of the 
former. It is concluded that the re- 
sults the Solomon = and 
Howes hypothesis and are consistent 
with “the postulation of a central 
cognitive affective construct which 
may be called ‘value area.’ "’ 

Haigh and Fiske (56) with an im- 
proved statistical procedure, have 
also repeated the Postman, Bruner, 
and McGinnies study (101). They 
style the use of the Allport-Vernon 
test by the previous authors an “‘in- 
direct’’ measure of value preference, 
and supplement it in their own work 
by a “direct’’ measure, which con- 
sists of a ranking by each S of the 36 
words shown to him tachistoscopi- 
cally. The rank order was obtained 
within four weeks after conducting 
the perceptual experiment. The re- 
sults obtained by the previous experi- 
menters were corroborated by use of 
the “‘indirect’’ method. Use of the 
new “‘direct’’ method also gave con- 
firmation but with a higher level of 
statistical significance. It is 


changes in 


disprove 


con- 
cluded that positive values tend to be 
associated with shorter recognition 
times. 

Two the Postman, 
Bruner, and McGinnies position were 
made by Mausner and Siegel (89). 
One was the word-familiarity argu- 
ment, discussed earlier in this sec- 
tion, and the second was based on the 
view that the Allport-Vernon instru- 
ment is an inadequate test of values. 


criticisms of 
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Seeking a situation in which the fac- 
tor of familiarity was controlled and 
wherein value was varied in a sim- 
ple manner, these authors designed 
an experiment in which adolescent 
stamp collectors were induced to learn 
the monetary “values” (i.e., alleged 
worth according to a firm of stamp 
merchants) of the various members of 
a set of stamps. No significant rela- 
tionship was found between recogni- 
tion thresholds for the stamps and 
their respective ‘‘values."’ The results 
were interpreted as evidence failing to 
support the Postman, Bruner, and 
McGinnies hypothesis. It should be 
noted, however, that the term “value” 
was in this experiment employed ina 
sense different from the Allport-Ver- 
non usage, and that learning scores 
were not reported. Inspection of the 
data shows a distinct, though statis- 
tically nonsignificant trend in the 
direction of lower thresholds for 
higher-valued stamps. Use of addi- 
tional controls and a more sensitive 
statistical test might well have re- 
sulted in the hypothesis’ being sup- 
ported. 

A method different from that of 
the studies discussed above was de- 
vised by McClelland and Liberman 
(85). On the basis of combined TAT 


and performance-task measures of 


n achievement, Ss were grouped into 


middle, and low 
Three months after the personality 
measures had been secured, the Ss 
were presented 
with verbal material. Relative to the 
threshold for neutral words, the 
high n-achievement group perceived 
achievement-related words  signifi- 
cantly more easily than did the mid- 
dle and low groups. Security-related 
words were perceived significantly 
more easily by the middle and high 
groups, as compared with the per- 
formance of the low on n 


high, categories. 


tachistoscopically 


group 
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achievement. The authors comment 
that the average familiarity ranking 
of the words employed as stimuli was 
sufficiently close to make highly im- 
plausible any interpretation of the re- 
sults in terms of differential famil- 
iarity. A parallel type of study by 
Lindner (77) showed that “sensitiza- 
tion” for ambiguous, sexually-sug- 
gestive picture material was greater 
for a group of sexual offenders than 
for a control group of nonsexual of- 
fenders. Though the responses were 
not of a socially approved character, 
they presumably reflected the posi- 
tive ‘“‘values’’ of this special group 

Results difficult to interpret in the 
present context are reported by Gil- 
christ, Ludeman, and Lysak (53). 
Groups of students representing the 
extremes of a distribution on an anti- 
Semitism scale were used as Ss. Posi- 
tively valued, negatively valued, and 
neutral words were used as stimull, 
each of which appeared on two slides, 
one containing the word “‘ink’’ above 
and below the stimulus word, and the 
other the word “Jew."’ A context was 
thus provided, but the S was asked 
to report only the stimulus word. It 
was found that both positive and 
negative values lowered word-recog- 
nition thresholds in comparison with 
neutral value, and also that emotion- 
ally loaded context has the effect of 
raising the thresholds for both posi- 
tively and negatively valued words, 
while lowering the thresholds for 
neutral words. These results cannot 
be explained in terms of the word- 
frequency hypothesis, since the posi- 
tively and negatively valued words 
were matched for frequency and the 
neutral words were actually chosen 
from a higher frequency category 
The authors point out that these re- 
sults pose problems both for the con- 
and 


The 


cept of “response suppression” 
that of “perceptual defense.”’ 
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same is true of the concept of selec- 
tive sensitization, unless this be 
broadened to refer to negatively val- 
ued, as well as positively valued stim- 
uli. 


Discussion 


An overview of the work on selec- 
tive sensitization leads compellingly 
to the conclusion that the concept is 
still a valid and useful one, and that 
the phenomena described by this 
term not artifacts of word fre- 
quency or any other spurious variable 
yet distinguished. What remain to 
be clarified are the conditions under 
which sensitization occurs. A start 
has been made in this direction (104, 
117), enabling us now to say with 
some assurance that preferred per- 
sonal value and moderate unfamiliar- 
ity of relevant words are conditions 
which produce the effect, when these 
words are exposed in isolation. One 
experiment (53) has indicated that 
when the stimuli are simultaneously 


are 


presented with a context of other 
words, sensitization occurs for words 
of negative as well as positive value. 
This enigmatic result is inconsistent 


It resists 
and 


with the bulk of evidence. 
any plausible interpretation 
clearly calls for further study. 

One possible direction for future re- 
search is pointed out by the instances 
(6, 101, 122) in which some Ss have 
functioned in a manner opposite to 
that predicted. The treatment of 
such data in terms of “selective 
vigilance” by the same writers who 
proposed the concept of “perceptual 
defense” has been criticized as an in- 
consistency (2, 62). Postman (100) 
has replied, showing that these con- 
cepts were not invoked as explana- 
tory principles and that the situation 
contains no more inconsistency than 
the parallel one in learning theory, 
where antithetical principles of facili- 
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tation and inhibition are found use- 
ful. If it is a fact that some indi- 
viduals consistently tend to show 
“sensitization” for positively valued 
and others tend to show “vigilance” 
for negatively valued stimuli, and if 
means can be devised to predict in 
advance which individuals will 
have in these respective ways, the 
important. A 
dimension is 


be- 


implications will be 
personality 
herein suggested, which may provide 
a basis for more effective control in 
value and 
Though such a control is hypothetical 
at present, it can be seen that if it 
should be developed and applied, 
there may then be avoided the kind 
of ambiguity inherent in the data and 
conclusions of Gilchrist, Ludeman, 
and Lysak (53). On the other hand, 
“vigilance” may 
function of one or several 
nized variables present in the experi- 
mental situation. the 
facts call for further elucidation. 

A surprising feature of the area re- 
viewed above is that with few ex« ep- 
tions (53, 77, 85), all investigators us- 
ing personality variables have chosen 
the Allport-Vernon 


The same logic 


possible 


perception research. 


the so-called be a 


unrecog- 


In either case, 


one instrument 

Study of Values. 
which prompted the Postman, Bru- 
ner, and McGinnies experiment could 
equally well have led to the selection 
of dominance-submission, extraver- 
sion-introversion egocentricity-al- 
truism or a host of other attributes 
of personality. In short, what Mc- 
Clelland and Liberman have at- 
tempted with n achievement remains 
to be done with other variables also. 
One probable reason why the field 
has not been further explored is the 
feeling that such correlates of percep- 
tion, in and of themselves, are of 
little value. As Bruner (20) remarks, 
such data serve not to explain percep 
tion but to indicate problems. One 
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such problem has been raised by the 
correlation between the Study of 
Values and _ certain recognition 
thresholds. Essentially, it is the 
question of a mediating mechanism 
between the trait (or “‘value’’) and 
the percept. What is now needed is 
close study of a variety of conditions 
under which the correlation appears 
and fails to appear. The Study of 
Values is necessarily the most 
sensitive or reliable instrument for 
such a search. Thus there is room for 
i great deal more exploratory work in 
the area considered. When the most 
satisfactory personality measure has 
been determined, the way will be 
clear for an intensive study of the fac- 
tors responsible for the correlation. 


not 


REACTIONS TO NOXIouS STIMULI 


The area of personal values, just 
considered, has been somewhat arbi- 
trarily from the much 
more extensive line of work which 
has been concerned primarily with 
the perception of noxious, “inimical,” 
threatening, or 


separated 


“taboo” material. 
The finding of raised thresholds in 


relation to such stimuli, as compared 


“neutral” ob- 
jects, was a phenomenon to which 
sruner and 1947 at- 
tached the “perceptual de- 
fense’’ (23). This and subsequent 
work by these writers (19, 101), and 
also by McGinnies (86), established 
this concept in the literature, aroused 
unusual interest and led to a surpris- 
ing amount of debate and dispute, 
some of which tended to be acrimoni- 


with thresholds for 
Postman in 
term 


ous. 

The concept of perceptual defense 
was not proposed by the original 
authors as an “explanatory” princi- 
ple, and it was made clear that work 
was needed to uncover the mediating 
mechanisms involved (28, 101). Not- 
withstanding this caution in present- 


NOEL JENKIN 


ing the idea, some psychologists re- 
acted with concern to the possibility 
that an “homunculus” process was 
implied (e.g., Howie, 62). It is clear 
that this assumption was not justi- 
fied. A more serious objection, how- 
ever, was made by Solomon and 
Howes (117), who claimed that two 
simple processes account for the data 
of these experiments. The first is the 
frequency hypothesis discussed in the 
preceding section, and the second is 
the view that 
words are not delayed in perceptual 
recognition but merely delayed in 
verbal report. Other writers (e.g., 
Whittaker, Gilchrist, and Fischer, 
126) have added weight to one or 
both of these arguments and have 
also proposed explanations in terms 
of selectively reporting sets. 
Postman, Bronson, and Gropper 
(107) posed the question in a general 


responses to taboo 


way: Can perceptual defense be re 
duced to the operation of determi- 
nants which are not specifically emo- 
tional? An answer was sought by de- 
signing an experiment in which taboo 
and neutral words were matched for 
frequency, and four different sets of 
instructions were used in order to 
manipulate the Ss’ readiness to re- 
port taboo words. A further attempt 
to vary the factor of selective verbal 
report was made by systematically 
varying the sex of E and the sex of S. 
Under all conditions, the thresholds 
for taboo words were found to be 
lower than for the neutral control 
words. This was thought to be due 
to a systematic underestimation of 
the familiarity of the taboo words 
Relative thresholds for the two types 
of words varied significantly with the 
nature of the instructions, in the di- 
rection of the naive group having 
higher thresholds than any of the 
groups forewarned to expect taboo 
words. On the basis of this finding 
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and from a review of former studies, 
it was concluded that “‘perceptual de- 
fense has, at best, the status of an un- 
confirmed hypothesis.”’ A similar ex- 
periment by Lacy, Lewinger, and 
Adamson (68) showed that the factor 
of expectation acted to reduce rec- 
ognition thresholds more rapidly for 
taboo words than for neutral words, 
and also that an habituation effect 
rapidly leveled the threshold differ- 
ences which occurred early in the se- 
ries between the two classes of words. 

In an experiment similar to that of 
Postman, Bronson, and Gropper, it 
was shown by Freeman (50) that 
when Ss are set to look for and report 
taboo words, their thresholds for 
these words are not higher than for 
neutral words. Continuing this 
study, Freeman (51) measured recog- 
nition thresholds of separate groups 
of male and female Ss. Neutral and 


taboo words were again used, and 
half of the Ss in each sex group were 


informed that the stimulus list would 
contain some taboo words. Typically, 
raised thresholds for taboo words 
were found in the uninformed group 
and little mean difference between 
the thresholds for the two classes of 
words in, the informed group. Sex of 
the Ss played a significant role, in- 
formed females showing less reduc- 
tion of the taboo word threshold than 
did informed males. A further experi- 
ment (51) was identical in design but 
used for the experimental group “ego 
involving” instructions which led Ss 
to believe that the perceptual task 
was related to academic success and 
aptitude. Ego involvement had the 
effect of reducing thresholds for all 
words (neutral as well as taboo) and 
this effect was much more pronounced 
for females than for males. In place 
of the ‘perceptual defense” interpre- 
tation, Freeman (50) proposes that 
raised thresholds occur “‘as a function 
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of the dominance of alternative sets 
which do not predispose S toward the 
perception of taboo material.” 
Several experiments have been re- 
ported which are claimed to demon- 
strate perceptual defense while con- 
trolling for the factors revealed in the 
previous group of studies. Repre- 
sentative of such work is a series of 
studies by Cowen and Beier. Their 
first experiment (38) required a 
group of Ss to examine a booklet in 
which decreasingly blurred versions 
of a single word appeared as the 
pages were turned. Several such 
booklets provided stimulus material 
consisting of threat and nonthreat 
words. A second group followed the 
same procedure, except that it was 
“alerted” to a threat experience by 
prior exposure to the words it would 
subsequently decipher. More time 
and trials were required to report the 
threat words than the nonthreat 
words under both alerted and non- 
alerted conditions, although the dif- 
ference was significantly greater for 
the latter condition. Both group and 
individual variability in responding 
to threat words under 
alerted conditions. A subsequent ex- 
periment (9) using the same technique 
confirmed at a higher level of signifi- 
cance the finding that Ss “though 
alerted to possible threat, neverthe- 
less respond less accurately and less 
promptly to threat words than! to 
neutral ones.” A third experiment 
(39) again confirmed this finding 
while controlling specifically for the 
variable of word frequency and also 
for social setting, insofar as the latter 
involved sex roles. The writers con- 
sider that an explanation of their re- 
sults in terms of conscious inhibition 
would be inadequate, and that word 
frequency must be excluded from any 
interpretation of the findings. 
Another approach to the problem 


increased 
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was made by Newton (95) who 
equated two lists of words for fre- 
quency, and found that under condi- 
tions of tachistoscopic exposure, sig- 
nificantly fewer errors of recognition 
were made of the pleasant than the 
unpleasant words. Since the latter 
were the “‘taboo” variety, it 
was felt that the probability of re- 
sponse-suppression was reduced to a 
minimum. Wiener (127) controlled 
for frequency by using the identical 
stimulus words as “threat” and “‘non- 
threat’ stimuli. 


not ot 


This was achieved 
by embedding the words in contexts 
which supplied different meanings 
for the different groups. Selective 
set was controlled by the use of 
“neutral” stimulus words in addi- 
tion to the critical words in a neutral 
context. The “threat’’ group re- 
quired significantly fewer trials than 
the “neutral” group to report the 
critical words correctly. The experi- 
menter claims that 


while the direc- 
tion of the difference is opposite to 


that shown by much other work on 
perceptual defense, this evidence is 
clearly in favor of motivation as a 
determinant of perception. A subse- 
quent experiment (33) clarified the 
former finding by distinguishing two 
groups of Ss on the basis of clinical 
criteria, predicting that they 
would show either sensitization or de- 
the 


Those classified as 


and 


fense in perceptual situation. 


“sex sensitizers” 
perceived sexual words with signifi- 
cantly fewer trials than did those 
classified as “' A simi- 
lar finding was reported for those 
classified in regard to hostility. <A 
third dichotomized group, distin- 
guished in terms of consciousness of 
personal adequacy, differed in the 
predicted direction without reaching 
the criterion for statistical signifi- 
cance. 

Eriksen (46) has also defended the 


sex repressers.”’ 
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notion of perceptual defense, though 
he justly criticizes the methodologi- 
cal errors of the “dirty word”’ pro- 
cedure. His own technique has been 
to study the phenomenon in relation 
to individual differences as distin- 
guished by clinical methods. In this 
he supports the argument of Lazarus 
(72) who points out in reply to Post- 
man, Bronson, and Gropper that re- 
pression is not the only mechanism of 
defense and that not all persons will 
deal with threat in the same way. 
Eriksen (44) with a heterogeneous 
sample of psychiatric patients, found 
a linear relationship between recogni- 
tion thresholds for pictures represent- 
ing aggressive behavior, and ratings 
of TAT stories for n aggression. Sen- 
sitization for the aggressive pictures 
corresponded with the expression of 
aggressive content in the stories, and 
perceptual defense (high thresholds) 
was coupled with minimal aggressive 
content, blocking, and incoherent and 
unelaborated stories when the cards 
were suggestive of aggressive inter- 
pretation. In another study, the 
same experimenter (43) found that 
disturbance in associating to aggres- 
sive, succorant, and homosexual words 
was positively related to recognition 
thresholds for scenes depicting people 
in the act of expressing or gratifying 
the corresponding needs. Evidence 
of a similar kind has been presented 
by Lazarus, Eriksen, and Fonda (74) 
who 
“‘intellectualizing”’ 


distinguished patients using 


mechanisms from 
’ mechanisms, 
and found that the former perceived 
material significantly more accurately 
than did the latter. A further experi- 
ment by Eriksen (45) used groups of 
Ss scoring at the extremes of a meas- 
ure testing for their recall of com- 
pleted and incompleted tasks. Per- 
ceptual defense was found only in Ss 
who had previously shown an avoid- 


those using ‘‘repressive’ 
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ance type of defense in the memory 
test. This resembles the Postman 
and Solomon (103) finding in which 
some Ss showed relatively high 
thresholds for words which are associ- 
ciated with a failure experience, and 
others showed relatively low thresh- 
olds for the same words associated 
with success experiences. As Eriksen 
(46) points out, an explanation in 
terms of degrees of familiarity with 
different words is not appropriate to 
the data of these two experiments. 
Furthermore, the response-suppres- 
sion argument (126) is met by them 
also, since in these experiments the 
perceptual stimuli were free of social 
taboos. A recent experiment by 
Eriksen and Browne (49) used groups 
of Ss respectively high and low on the 
psychasthenia scale of the MMPI. 
After an experimentally produced 
failure experience, involving expo- 
sure to a list of words, recognition 
thresholds for the failure-related 
words and a control series of neutral 
words were measured. A _ reduced 
threshold for the failure words, rela- 
tive to the neutral words, was found, 
but there was less reduction for the 
low psychasthenia group. The signifi- 
cant interaction was regarded as evi- 
dence for perceptual defense, which ts 
interpreted in terms of principles de- 
rived from punishment and avoidance 
conditioning. 

The finding of McGinnies (86) that 
measurable autonomic reactions to 
emotionally loaded stimulus material 
occur at subthreshold exposures was 
by Aronfreed, Messick, 
and Diggory (3) who found an in- 
crease in GSR at the stage of recogni- 
tion of unpleasant, as contrasted with 
neutral and pleasant, words. There 
was no significant difference between 
thresholds for the latter two classes 
of words, and hence some evidence is 
provided for the notion of perceptual 


challenged 
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defense, but none for perceptual sen- 
sitization. In contrast with the find- 
ings of some other experimenters 
(102, 107) there were no significant 
differences between the mean thresh- 
olds of “informed” groups (‘‘set”’ for 
the type of stimulus to be presented) 
and “uninformed” groups. Good- 
stein (55) has pointed out that Aron- 
freed, Messick, and Diggory failed 
to equate their stimulus words for 
frequency, and that this variable 
could have determined the results. 
Using picture material of aggressive 
and neutral content, Stein (119) 
studied the responses of a sample of 
neurotic patients. It was found that 
these could be classified as either “‘de- 
fenders” or “‘sensitizers,"’ depending 
on whether their thresholds for the 
aggressive material were respectively 
above or below their thresholds for 
the neutral material. Subsequent 
tests established the reliability of the 
measures and demonstrated the con- 
sistency of these patients in adopting 
one or other of these types of mech- 
anism. 

Numerous experiments over the 
past two years have reported contra- 
dictory results and have reflected dif- 
fering theoretical predilections. De 
Lucia and Stagner (40) report that 
word-recognition time is clearly af- 
fected by two sets of determinants: 
frequency of 
arousing value. It is suggested that 
future Work could usefully aim at re- 
lating each of these more effectively 
to specific personalities. Reece (111) 
obtains results enabling him to con- 


usage and emotion- 


clude that deductions based upon the 


principles of reward learning theory 
can effectively predict differences in 
visual recognition thresholds. 
land (67) using auditory presentation 
of emotional words, failed to find any 


Kur- 


difference between 
thresholds of 


the recognition 
obsessive-compulsive 
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and hysteria patients respectively, a 
result which would not be predicted 
from Eriksen’'s (45) hypothesis. Kur- 
land also found that the combined 
patient groups perceived the emo- 
tional words at significantly lower 
thresholds than did the normal Ss, a 
further discovery inconsistent with 
much other work. 

Another challenging finding is that 
of Bitterman and Kniffin (11) who 
tested undergraduate women for 
their recognition of neutral and taboo 
words, and who also administered to 
the Ss the MMPI and the Taylor 
Manifest Anxiety Scale. No signifi- 
cant relation between anxiety level 
and recognition threshold was found. 
The difference between thresholds for 
neutral and taboo words was signifi- 
cant, but this difference correlated 
positively with the Pd score of the 
MMPI, and was unrelated to K, Hy, 
Sc, or Anxiety. Concluding that the 
differences in threshold can be better 
understood in terms of differential 
readiness to report rather than in 
terms of perceptual distortion, the 
authors question an interpretation by 
McGinnies and Sherman (88) of this 
kind of data. The latter writers as- 
sumed that taboo words serve as sig- 
nals of punishment, and therefore 
become cues for eliciting anxiety 
which engenders avoidance reactions. 
This situation elevates recognition 
thresholds and may persist long 
enough to interfere with the percep- 
tion of subsequent neutral words. 
Chodorkoff (36) has shown that this 
interpretation is not assailed by the 
evidence of Bitterman and Kniffin, 
since the latter writers measured the 
general level of anxiety of the Ss, 
rather than the crucial variable, i.e., 
the degree of anxiety which each 
word is able to elicit. His own work 
(35) supports the view that high- and 
low-anxiety groups would show dif- 
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ferences in recognition thresholds 
providing that personally relevant 
stimuli are selected for each S. 
Further work by Chodorkoff (37) 
has led to a revised theoretical posi- 
tion which takes account of new evi- 
dence. Word-association time (as- 
sumed to measure degree of threat) 
is unrelated to individual defensive- 
ness as measured by the difference 
between recognition thresholds for 
neutral and threatening words, but is 
related to the absolute value of the 
perceptual measure; i.e., the degree 
of deviation of the threshold for the 
critical words, either above or below 
that for the neutral words, regardless 
of sign. Implying that the word- 
association measure is related to the 
extent of the perceptual reaction but 
not to its direction, the data provide 
further evidence consistent with the 
view that there are individual differ- 
ences in the choice of either ‘‘vigi- 
lant” or “defensive’’ reactions to 
threatening stimulation. 

Osler and Lewinsohn (96), using a 
carefully controlled stimulus-match- 
ing procedure, found that the thresh- 
olds for unacceptable words were 
lower than for acceptable words, a 
result which runs counter to the ma- 
jority of previous findings. The data 
are interpreted as implying that anx- 
iety is associated with greater ‘‘vigi- 
lance.”” Neel (93) showed tachisto- 
scopically, to various groups of female 


Ss, pictures of persons engaged in 


various sexual and aggressive ac- 
tivities, and used a multiple-choice 
situation toobtain responses. Women 
judged as lacking conflict in the areas 
of sex and aggression showed ‘‘vigi- 
lance”’ to stimuli related to mild sex- 
ual behavior, and also revealed “re- 
pression”’ in response to stimuli re- 
lated to directly sexual situations. 
There was also in this group “‘sensi- 
tivity’’ to directly hostile situations 
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and ‘‘avoidance’”’ of stimuli related to 
mild aggression. The sex-conflict 
group reacted similarly but less con- 
sistently, and the aggression-conflict 
group tended to ‘“‘avoid’’ recognition 
of all aggressive situations. Kleinman 
(66), measuring changes in auditory 
perceptual thresholds in cases of psy- 
chogenic deafness as compared with 
cases of organic deafness, also found 
results consistent with the hypothesis 
of perceptual defense. 

Among the most recent clinically 
oriented researches which support the 
notion of a ‘“‘mechanism”’ of percep- 
tual defense are the ingenious experi- 
ments of Blum. In the first of these 
(14), the Blacky pictures were tachis- 
toscopically presented before and 
after a situation in which feelings of 
psychosexual conflict were aroused 
in the Ss. The traumatic picture was 
selected by the latter as having 
“stood out the most” significantly 
more often on the second run than on 
the first, despite the facts, first, that 
both series were flashed at subthresh- 
old speeds and, secondly, that there 
was no conscious recognition of the 
pictres on either set of trials. This is 
interpreted as vigilance, at an un- 
conscious level, to cues relevant to 
the threatening impulses of sex and 
aggression. With increased exposure 
times and instructions to locate a par- 
ticular picture, attempt was made 
“to bring the ego into play.”’ In this 
ego-involving situation, significantly 
fewer correct locations were made of 
the traumatic than the neutral pic- 
ture, thus indicating perceptual de- 
fense. 

Blum’s second experiment (15) also 
presented certain Blacky pictures 
tachistoscopically and was meticu- 
lously controlled for the variables of 
selective verbal report, familiarity, 
set, and antecedent conditions. Four 
pictures were simultaneously flashed 
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on a screen at a speed too brief for 
recognition. Judgments as to which 
pictures were being shown were nev- 
ertheless required. A control condi- 
tion was supplied by the fact, un- 
known to the Ss, that of the 11 
Blacky pictures familiar to them, only 
4 were presented. Responses were 
classified in relation (a) to whether 
the pictures mentioned by S were 
present or absent, and (0) to the pres- 
ence or absence in S of conflict plus 
repression, as independently meas- 
ured on the Blacky dimensions. Toa 
highly significant degree, Ss avoided 
calling the names of pictures relating 
to their own conflicts and repressions, 
but only when these pictures were 
(subliminally) present. No such 
avoidance behavior (relative to the 
pictures which were “neutral” for 
them) was shown toward the pictures 
which were not presented. The faults 
in experimental design which Post- 
man et al. (107) showed to be present 
in early work on perceptual defense 
are avoided in this study, it is 
claimed, and furthermore, the results 
are not easy to interpret within the 
framework of the Bruner and Post- 
man hypothesis theory of perception. 
Blum’'s work has been extended by 
Nelson (94), with application to the 
specific personality dynamics of the 
individual. Finding that the individ- 
ual perceives in accordance with his 
areas of high and low conflict and his 
defense preferences on a variety of 
psychosexual dimensions, Nelson em- 
phasizes the value of psychoanalytic 
theory as a basis for research upon 
perceptual vigilance and defense. 


Discussion 


A proportion of the literature re- 


viewed above seeks to reduce the 
phenomenon of defense to simpler 
and more familiar principles, to min- 
imize its significance as a special 
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problem and to challenge its inter- 
pretation in terms foreign to the ex- 
isting body of general psychological 
theory. Such attempts are plausible 
with respect to some specific results, 
but are probably inadequate as a 
basis for explaining all available data. 
As pointed out by Postman, Bruner, 
and McGinnies (101), the same prob- 
lem exists in the phenomena of hys- 
terical and hypnotically induced blind- 
ness. Data of this kind cannot be 
readily explained by the kind of argu- 
ments directed at some laboratory 
studies of perceptual defense. Con- 
servatively biased criticism has in- 
deed served a useful purpose, as indi- 
cated above, but it must now be ad- 
mitted that Eriksen and Browne (49) 
have correctly summed up the posi- 
tion in stating that ‘a firm body of 
experimental support for such a phe- 
nomenon has remained untouched by 
criticism.” 

Among the experimentally oriented 
clinicians who have increasingly en- 
tered this field, the opinion has grown 
that perceptual defense is not a phe- 
nomenon pointing to some underly- 
ing general law. 
(e.g., 
pearing 


Some general laws 
to frequently ap- 
may influence its 
manifestation under some conditions, 
but for a clear demonstration of its 
appearance with the factors of set, 
selective report, and frequency con- 
trolled, it is necessary to design the 
experiment to take account of certain 
critical individual differences. 

On the other hand, perceptual 
theory has the task of accounting for 
perceptual phenomena, regardless of 
whether individual differences are 
systematically included or minimized 
in the experimental design. The 


sensitivil y 
words) 


search for a general explanation ap- 
peared at a relatively early stage in 
the controversy about “perceptual 


defense."" Bruner and Postman (28) 
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were responsible for introducing the 
notion that recognition need not nec- 
essarily be defined as coincident with 
veridical report and that there may 
be a hierarchy of response thresholds 
to a given stimulus. Some of these, 
such as an affective-avoidance reac- 
tion, may well be tripped off prior to 
the threshold for veridical report. 
Testing an explicit hypothesis along 
these lines, McGinnies (86) gave some 
experimental support to this kind of 
theorizing. Subsequently, McCleary 
and Lazarus (73, 82) reported evi- 
dence for a phenomenon termed 
“subception” or discriminative auto- 
nomic response to subliminal stimu- 
lation. Support for this finding has 
come from Taylor (120) and from 
Rubenfeld, Lowenfeld, and Guthrie 
(78, 113). A challenge to the interpre- 
tation of such data as “‘discrimination 
without awareness’’ has been pre- 
sented by Bricker and Chapanis (16), 
Howes (60), Murdock (91), Lysak 
(80), Eriksen (47), and Voor (124). 
Such an interpretation, however, is 
not crucial to the hypothesis of 
Bruner and Postman. The affective- 
avoidance reaction could be mediated 
by a conscious, though partial, recog- 
nition, equally as well as by an “un- 
conscious awareness.”’ In either case, 
the work on so-called ‘‘subception”’ 
is consistent with their view which 
still stands as a plausible explanation 
of much work on perceptual defense. 

An alternative view, supported by 
experimental evidence, is offered by 
Hochberg, Haber, and Ryan (59). It 
is supposed that during the interval 
between stimulation and report, a 
rapid sequence of events may occur, 
involving, for example, an autonomic 
response. This may be so strong, rela- 
tive to the tender, newborn memory 
trace, as to disrupt the latter and thus 
prevent recognition and recall of the 
briefly presented material. 
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Explanations such as the two cited 
above have the appeal of elegant sim- 
plicity. Both are susceptible of ap- 
plication to data concerning differ- 
ences in the type of material for which 
defense occurs and in the type of S 
who shows this behavior. These 
interpretations relate strictly to per- 
ceptual defense, however, and make 
no attempt to subsume either the 
“vigilance” or the “sensitization” 
effects. It would seem uneconomical 
to have two or three different prin- 
ciples to explain classes of perceptual 
phenomena which seem closely re- 
lated. Further research aimed at 


clarifying the interrelationships of the 
three types of phenomena, and at 
exploring the possibility of a unitary 
underlying principle would seem to 
be required. 


THE DEFINITION OF PERCEPTION 


Review of the present field would 
be incomplete without some reference 
to a major issue; that is, the definition 
of perception and the specification of 
the level at which need states, per- 
sonality factors, past experiences, 
and present expectations operate up- 
on the experience reported by S. The 
great majority of experiments re- 
viewed above have introduced a meas- 
ure of ambiguity into the stimulus 
situation. By brief or unclear ex- 
posure of the material, or by a delay 
between stimulus and response, maxi- 
mum play has been given to sub- 
jective factors. Can the operation of 
these be described in terms of think- 
ing, imagining, or problem solving 
rather than as part of a broadly de- 
fined perceptual process? 

Where S is provided with fuller in- 
formation, as in “conventional” psy- 
chophysics (so styled to distinguish 
it from some studies reviewed above), 
his behavior tends to support the view 
that the percept is stimulus-bound 


121 


(108). The social or‘clinical psychol- 
ogist, on the other hand, finds it use- 
ful to broaden his definition. Bruner 
(21), for example, justifies the experi- 
menter’s use of ambiguity by noting 
that: .. . most complex perception, 
particularly in our social lives, is de- 
pendent upon the integration of in- 
formation of a far less reliable kind 
than we normally provide in a tachis- 
toscope at rapid exposure.” 

The question involves more than 
semantic convenience. To some, it is 
a central theoretical issue. An in- 
stance is provided by Wallach’s (125) 
distinction between a sensorily de- 
termined perceptual experience and a 
recalled trace complex which gives 
identity and “total meaning" to the 
experience. Acceptance of this prem- 
ise could lead to the argument tMat 
need operates upon the recalled trace 
complex rather than upon the per- 
ceptual experience. But the validity 
of the premise is open to question, 
as shown, for example, by the experi- 
ment of Hastorf (57) who found that 
the apparent distance of an object 
depended on the degree of assurned 
size attributed to it. That is to say, 
identification of the object was a 
necessary condition for the “‘primary” 
perceptual experience of distance. 
Pratt (108) argues that Hastorf's 
results are explained by a shift in the 
judgmental frame of reference, and 
Prentice (109) questions whether the 
location of the object really looked 
different under the different experi- 
mental conditions. Such skepticism 
appears rather forced when it is 
noted that Hastorf dealt carefully 
with this point in his report (57, pp 
208-209 and pp. 212-213) and con- 
cluded on the basis of both quantita- 
tive and qualitative evidence that the 
judgments of the subjects “had very 
definite perceptual aspects and were 
not purely intellectual in nature.” 
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More recently, Bruner and Min- 
turn (31) have pointed out that the 
act of identification can modify the 
primitive perceptual organization of 
the field. Their experiment showed 
unequivocally that, for their Ss, 
closure was not a self-determining 
and “pure” perceptual or stimulus 
process, but operated differentially 
with respect to the identification 
given the object. A broken letter “B”’ 
was recognized as a B when the sub- 
jects expected to see a letter, and was 
recognized as the figure 13 when they 
were expecting to see a number. 
Work of this kind gives us good rea- 
son to believe that “perception” on 
the one hand, and “identification,” 
or “recognition” on the other hand, 
while analytically separable under 
some conditions, cannot be distin- 
guished under others, and hence the 
distinction is theoretically untenable. 
If this be accepted, then certainly it 
cannot be agreed that need and per- 


ception relationships are ‘‘so hard to 


demonstrate” (109), and it should 
probably be disputed also that such 
relationships lack importance and 
generality (123). On the other hand, 
it must be recognized that the con- 
troversy is far from its final resolu- 
tion. While further discussion would 
take us beyond the limits set for the 
present review, it might at least be 
concluded that progress is needed not 
only in need and perception research 
but also in memory and concept for- 
mation before adequate perspective 
can be reached (cf. 22, 32). 


SUMMARY AND CONCLUSIONS 


The present review was prompted 
by the divergent evaluations of sever- 
al commentators upon research in the 
field of affective processes in percep- 
tion. Four areas were selected in 
which activity has been strong for 
several years past. Studies of size 
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judgment provide a category defined 
by the general nature of the depend- 
ent variable. Design and technique 
in such experiments has progressive- 
ly improved. Several findings indi- 
cate that certain motivational states 
are determinants of size judgment, 
and this is true of some recent work 
as well as of the earlier studies. A 
satisfactory theory to account for the 
correlation is lacking. Physiological 
need in relation to perception is the 
second area considered. From the 
several studies reviewed it is con- 
cluded that the weight of evidence 
favors the hypothesis that need is a 
determinant of perception, but that 
only a beginning has been made in the 
search for reasons to explain this 
relationship. A third area comprises 
a group of studies on ‘‘selective sensi- 
tization” for stimuli representing 
positive values. Such a phenomenon 
appears to be well established, and 
not wholly due to artifacts of experi- 
mental procedure. Adequate speci- 
fication of the conditions under Which 
sensitization occurs waits upon future 
activity. The final, and most prolific, 
area dealt with concerns reactions to 
stimuli presumed to be noxious or 
threatening to Ss. Here are reviewed 
a considerable number of studies 
attempting to demonstrate or deny 
the phenomenon known as “percep- 
tual defense."’ Challenging findings 
are presented by clinically oriented 
studies involving individual differ- 
ences and also by experiments re- 
lated to theory of a more generalized 
type. The problem of discovering the 
mediating mechanisms responsible 
for the perceptual-motivational cor- 
relation has stimulated some useful 
research, as witnessed, for example, 
by the work on “‘subception,” and has 
also provoked some thinking which 
will direct future investigations. A 
final section is included which deals 
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with some contrasting ways of de- 
fining the term perception. 

It can be concluded that studies on 
perception in relation to various 
affective processes have been amply 
successful in the raising oi important 
problems and the setting of useful 
directions for future work. Integra- 


tion of the complex and oft-seeming 
contradictory body of data is much 
needed, but some progress is being 
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made in this direction. Future work 
requires renewed sifting of the evi- 
dence, replication of studies, par- 
ticularly those with conflicting results, 
continued improvement in method- 
ology, a sense of direction towards 
theoretical objectives, recognition of 
work in related areas, and at least 
some coordination of effort between 
the widely ramifying branches of 
the field. 
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ADDITIONAL “POST-MORTEM” TESTS OF 
EXPERIMENTAL COMPARISONS 


JULIAN C. STANLEY 
University of Wisconsin' 


McHugh and Ellis (3) furnish an 
illustration of Scheffé's (5) method 
for judging all contrasts in the an- 
alysis of variance. It seems desirable 
to emphasize here that Scheffé's pro- 
cedure, being quite general, is more 
conservative than necessary for most 
comparisons of interest to psycholo- 
that is, it then results in too 
few rejections of the null hypothesis 

wide intervals 
the obtained difference be- 
tween Frequently, alterna- 
tive tests suggested by Tukey? and 
Dunnett (1) will be more powerful. 

Scheffé offers a method for testing 
among means all possible contrasts 
that were not incorporated explicitly 
into the experimental design, pro- 
vided only that the sum of the 
weights assigned to the various 
means is zero® (for planned contrasts, 
use Student's 4). Suppose there are 
4 means, as in the McHugh-Ellis 
analysis. We may ar bitrarily take the 
sum of 10 the first mean, 3 
times the second mean, and 6 times 
the third, so long as we subtract from 
(104-346) =19 times the 
The variance of this 


gists 


and too confidence 


around 


means. 


times 


this sum 


fourth mean. 


' Postdoctoral fellow in statistics, Univer- 
sity of Chicago, 1955-56. Thanks are due 
James C. Reed for helpful suggestions. 

* McHugh and Ellis mention Tukey’s test 
in a footnote but do not illustrate it, nor do 
they emphasize the overgenerality of Scheffé's 
procedure for most psychological experiments. 
In a long, mimeographed, undated manuscript 
entitled “The problem of multiple compari- 
sons” Tukey gives the rationale for his 
method, which Scheffé (5) discusses 

* Even when the over-all F is not significant, 
one may ascertain precise confidence limits 
by the Scheffé, Tukey, or Dunnett procedures. 


128 


difference among these independent 
means will be 10? times the variance 
of the first mean plus 3? times the 
variance of the second plus 6? times 
the variance of the third plus 19? 
times the variance of the fourth. 
Since the random-sampling variance 
of a mean is o*/n and we use the mean 
square for error, s*, as an estimate 
of the population variance for each 
group, the variance of the weighted 
composite above will be s?(100/n, 
+9/n2+36/n3+361/n,4). In the Mc- 
Hugh-Ellis article, s?= 31.5 and each 
n; is 12, so the variance of the com- 
posite is 


31.5(100+9+36+361 
00+ 9F 564568) on 25 


12 


The square root of this, 36.45, is 
to be compared in some manner with 
the net difference among the weighted 
means, 10(105.61) +3(112.27) 
-+6(103.93) —19(114.05) = —150.46. 
The ratio of 150.46 to 36.45 is 4.13, 
which we might naively (and incor- 
rectly) compare with a ¢ with 32 df 
at, say, the a= .01 level of significance 
for a two-tailed test, which is 2.74. 
The appropriate comparison is with 


V(k—-1)F a—e, b-1, afy) 


= 1/3(4.46) = 3.66, 


k = 4 being the total number of means 
that might be compared. Since 4.13 
exceeds 3.66, the difference is signifi- 
cant beyond the .01 level. The incor- 
rect .99 confidence interval is — 150.46 
+ 2.74(36.45) = — 250.33 to —50.59, 
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while the correct interval is much 
wider: — 150.46 + 3.66(36.45) = 
— 283.87 to —17.05. In this experi- 
ment one may, knowing 3.66, set up 
a confidence interval for any com- 
parison, 

k 


D «Xi, 


t=) 


the cs being fixed weights (— « <c, 
<«) such that 


he 
> c,=0. 
t=! 


Often, perhaps usually, we make 
a posteriori contrasts of just two 
means, X; and X;, weighted equally; 
let ¢,=1, c;=—1, and all other cs 
=0. For this particular type of com- 
parison, and if n,;=n,;, Tukey's 
method is preferable to Scheffé’s. As 
an illustration, consider .99 confi- 
dence intervals for X¥,.—X¥,=112.27 
—105.61=6.66 from the McHugh- 
Ellis data. The simple ¢-test limits 
are too narrow: 


6.66 + 2.74/31.5s/1/124+1/12 
= 6.664 6.27. 
Scheffé’s are too broad: 
6.66 + 3.66/31 .59/1/12+1/12 
= 6.66 + 8.38. 


Tukey’s are intermediate between the 
above two: 


6.66+9V31.5/1/12 =6.66 + 7.74, 


where g = 4.775 is the upper 1% point 
of Studentized range (4, p. 177) for a 
sample of 4 means with 32 df for s?. 

Thus when interested only in con- 
trasts of the form (X,—X,), use 
Tukey's method if the m,s are equal. 
For differing ns and/or more com- 
plicated contrasts, use Schefié's pro- 
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cedure, which is completely general 
within the restriction that the sum 
of the weights applied to the k means 
must be zero. All three methods 
(simple ¢ test, Tukey's, and Scheffé's) 
give identical results when k = 2, 

Where a “standard” treatment has 
been incorporated into the experi- 
mental design, as for example when 
one control group and three differ- 
ent experimental groups are used, we 
can narrow the confidence interval 
even more by employing Dunnett's 
(1) tables. Suppose that in the above 
example Method 2 was the predesig- 
nated standard with which each of 
the three other methods was to be 
compared. The appropriate .99 con- 
fidence interval for the contrast with 
Method 1 would be 6.66 + 3.15\/31.5 
V 2/12 = 6.66 + 7.21, where 3.15 is the 
figure obtained from Dunnett's Table 
26 (1, p. 1120) for 3 “‘treatment’’ 
means and 32 df. This interval will 
typically be narrower than limits 
secured via Tukey's procedure, but 
wider than those from the simple ¢ 
test, unless k=2, when all are identi- 
cal.® 

The Tukey and Dunnett methods 
both require that my=m,g= +++ =m. 
Otherwise, probability values for the 
confidence intervals are approximate. 


CONCLUDING REMARKS 


The techniques for obtaining con- 
fidence intervals described above ap- 


‘He also furnishes tables for one-tailed 
comparisons, to be used when an explicit 
hypothesis specifying the direction of devia- 
tion of the treatment mean from the standard 
was stated before measures were secured 

* By D. B. Duncan's multiple range test 
(Multiple range and multiple F tests, Bio- 
metrics, 1955, 11, 1-42), four of the six possible 
simple comparisons among the four means are 
significant at the .01 level. M, (114.05) does 
not differ significantly from My, (112.27), nor 
does M, (105.61) from M, (103.93). 
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ply especially to Model I (fixed- 
effect) designs where levels of the 
factor tested are unordered. For the 
fixed effects in “‘mixed” models, see 


JULIAN C. STANLEY 


Scheffé (6, p. 33; 7, pp. 261-263). 
Kurtz (2) outlines a procedure to 
use when error variances are unequal. 
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Situations frequently occur in 
which an investigator obtains the 
correlations between the same two 
variables in several groups of subjects 
and wants to obtain some over-all 
estimate of the degree of correlation 
between the two variables. For 
example, one might find the correla- 
tions between scores on a measure of 
Mechanical Information (Me) and 
scores on a measure of Mathematics 
Background (MB) for students in 
several trade and technical schools 
and want to know a general value 
expressing the degree of relationship 
between these two variables. 

With respect to the variables being 
correlated various situations might 
exist. Several examples are shown in 
Fig. 1. A set of data such as that in 
Fig. 1A might occur if students in 
some schools were extensively trained 
in mathematics but had little shop 
experience, while students in other 
schools received the opposite pattern 
of training. Note that within the 
schools (A, B, C, D) the correlations 
are all positive and about equal in 
magnitude, but, due to the negative 
correlation between means, a coeffi- 
cient obtained by plotting the data 
from all schools on a single scatter 
diagram would be negative (the 
dotted ellipse). Figure 1B represents 
a situation with a perfect positive 
correlation between means. The coef- 
ficient of correlation obtained in any 
school, such as G, might be corrected 
for restriction in range, in this case, 
to give the value which would be ob- 





tained if all data were plotted on a 
single scatter diagram. In Fig. 1C the 
correlation between means is about 
equal to the correlation within the 
groups, and the dotted ellipse has 
about the same shape as the solid-line 
ellipses. 














Fic. 1. ILLUSTRATION OF THE INFLUENCE 
or Corre_atTions Brtween SupGcrour 
Means ON TotaL-Group CORRELATION. 


If an investigator is in a situation 
in which he desires to know what cor- 
relation he would obtain if he mingled 
the data from all groups on a single 
scatter diagram, and if he knows the 
sample sizes (m,), the correlations 
(rx,y,), the means, and the standard 
deviations (ox, and oy,) for all the 
groups (1,---,j,°°*, m), he can 
compute such a coefficient by means 
of Formula 1, presented in a slightly 
different form by Dunlap (1). 


m m 
bs njox oy Ix,¥,+ > nb jAj 


j=! 


Dd nj(o*x,+8%,) 
j= 


j=! 


> no*y,+4%,) 
j=l 
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where 4, is the difference between the 
mean of the X values for group j and 
the mean of X for all cases, and A; is 
the difference between the mean of 
the Y values for group j and the 
mean of Y for all the cases. (In this 
paper it is assumed that all os are 
computed using nm, in the denomina- 
tor. If m;—1 is used, this value should 
replace n, in each formula.) 

The above procedure for obtaining 
an over-all estimate of the degree of 
correlation between the two variables 
might be appropriate if the investiga- 
tor was in a practical situation in 
which he could not use knowledge of 
the group from which each subject 
However, in theoretical 
studies an investigator is often inter- 
ested in ascertaining the degree of 
relationship between the variables if 
all groups of subjects had the same 
opportunity or training. Then it is 
more useful to think in terms of the 
correlation when the groups have 
been placed on the scatter diagram so 
that the means all coincide. To ob- 
tain a value for the correlation under 
this condition, if the sampling of in- 
dividuals has been random and inde- 
pendent from populations that are 
the same with regard to correlation, 
one may compute the weighted aver- 
age correlation coefficient directly or 
by means of Fisher's z transforma- 
tion. However, in the usual case 
where one has used intact groups of 
subjects the independent random 
sampling is of groups rather than of 
individuals, and here Lindquist (3, 
pp. 219-221) recommends that one 
compute a within-groups correlation 
by use of analysis of covariance. If 


comes. 
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the groups do not differ widely in 
correlation, Lindquist indicates that 
the resulting coefficient may be 
treated as though it had been ob- 
tained from a simple random sample 


of 
pm nj—m+1 


jel 
individuals, where m is the number of 
groups. The degrees of freedom 


would be 


m 
> nj—m. 
j=l 

Although Lindquist discusses the 
computation of the within-groups 
correlation (3, pp. 219 ff.), he leaves 
the reader who is not well acquainted 
with analysis of variance and co- 
variance procedures to piece to- 
gether from scattered parts of his 
book a computational procedure. 
McNemar (4, p. 321, p. 327) men- 
tions the within-groups coefficient, 
but he does not discuss its use. It is 
mentioned in other scattered refer- 
ences. The present note brings to- 
gether the discussion of its use and 
two approaches for computing the 
within-groups correlation. | Which 
computational approach is to be used 
depends on the investigator's other 
manipulations of the data. 

If the investigator is not particu- 
larly interested in examining the indi- 
vidual correlations within each group 
of subjects, and is not interested in 
the means and standard deviations of 
the variables for his groups, it is effi- 
cient to use the raw-score formula, 2, 
where the individuals in group j are 
numbered 1,-+-, 4,-:--, &, and 
k= nj. 


. k 
> X i3 » ie Yj 


k 


[2] 


sin 29 


j=l t=1 j=l k 
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In the situation in which the in- 
vestigator has already computed the 
correlations and standard deviations 
within the individual groups of sub- 
jects and also wants an over-all 
evaluation of the degree of relation- 
ship between the variables, he may 
compute the within-groups correla- 
tion by means of an equivalent 
formula, 3, suggested to the writer 
by S. S. Wilks. 


me 
Lo s9xveoxer, 


j=l 


> nots, 4/ ee njo*y; 


j=l j=l 


(3) 


Here rx,y, is the correlation between 
X and y in group j, Ox, is the stand- 
ard deviation of variable X in group 


- 


jel t= j=l 


y 
VY Dr} xj > X3,;- 


> MTX VOX OV; 


= — -— 


. S 
MES Mix jx” il & MiP yiviOVs 
-! 


T 


j, and ay, is the etnnliied deviation of 
variable Y in group j. 

Before correlations are averaged 
(directly or by means of the z trans- 


bury $47 


Ed] 
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formation) they may be corrected for 
attenuation due to unreliability. In 
the case of within-groups correlations 
Wilks has shown the writer that the 
correction for attenuation may be in- 
corporated directly into the computa- 
tions. For various groups of subjects 
reliability estimates of a measure are 
likely to differ due to different stand- 
ard deviations for the measure in 
different groups. In Formulas 4 and 
5, corresponding to 2 and 3, respec- 
tively, but introducing the correction 
for attenuation, rx,x, and ry,y, are 
reliability coefficients for variables X 
and Y, computed separately for each 
group of subjects. The correction for 
attenuation is introduced by substi- 
tuting Formula 3 into the standard 
formula for correction for attenua- 
tion. 


> Nu > " 


b 


PO 


The assumptions of analysis of co- 
variance are discussed by Jackson (2) 
who provides illustrations of tests of 
these assumptions. 
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A GENERAL METHOD OF ANALYSIS OF FREQUENCY 
DATA FOR MULTIPLE CLASSIFICATION DESIGNS 
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Fisher (3) has shown the advan- 
tages of the factorial experiment over 
the classical method of “one vari- 
able.” The following gains accrue 
from consideration of the effects of 
independent variables (treatments) 
upon the dependent variable in the 
context of other independent vari- 
ables: (a) With a sample of size n, 
and k treatment classifications which 
do not interact, “hidden replication” 
enables estimation of all k main ef- 
fects with the same precision as 
would be achieved for one in a single 
factor experiment of the same size. 
The economy of the factorial design 
is indicated by the fact that to obtain 
the same amount of information by 
the ‘‘rule of one variable,"’ one would 


(b) If 


need k sets of n replicates. 
there is interaction among the treat- 
ments, the factorial arrangement en- 
ables its isolation and evaluation and 
thereby sets the limits of generaliza- 


tion. One can specify the effect of 
the independent upon the dependent 
variable in a variety of contexts; and 
conversely, if interaction is zero one 
may conclude that the relationship is 
constant through all contexts con- 
sidered. (c) A further virtue of the 
factorial design lies in the informa- 
tion it may provide about the rela- 
tive efficacy of different combina- 
tions of conditions for the production 
of given effects. Most use has been 
made of this in agriculture and indus- 
try, but it has its scientific as well as 
its technological applications, such 
as in sorting out necessary and suf- 
ficient conditions 

In practice the factorial design has 
most often been used where it is pos- 
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sible to obtain measurements on the 
dependent variable, so that statisti- 
cal analysis of the outcomes is by way 
of the “analysis of variance.” In 
many research areas, however, phe- 
nomena are not yet amenable to scal- 
ing so that one has counts or fre- 
quencies within given categories ra- 
ther than measures, e.g., male versus 
female rather than degrees of sexu- 
ality. There is no logical hindrance 
to the use of factorial experimenta- 
tion with these phenomena, and such 
is to be recommended in light of the 
advantages to be gained. The prob- 
lem is to find a method of statistical 
analysis appropriate to this type of 
data. x? methods are available for 
the comparison of sampled frequen- 
cies and for assessing association in 
simple contingency tables. These 
cases are in effect instances of single 
and double classification designs, and 
if contingency association is the 
analogue of interaction in analysis of 
variance, then a method of assessing 
multiple contingency is needed for 
the analysis of frequency data from 
higher order designs. Pearson (6) 
described a procedure for assessing 
multiple contingency but failed to 
consider the question of additivity 
of x? components. Bartlett (1) of- 
fered a method for the 2X2 X2 case 
which involves the solution of a cubic 
equation and is difficult to apply in 
practice. Recently, Lancaster (5) 
following proofs by Irwin (2) and 
Lancaster (4) has devised a general 
method of partitioning a total x? and 
degrees of freedom into independent 
additive components due to given 
sources of variation. This completes 
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the parallel with the analysis of vari- 
ance in which the total sum of squares 
and degrees of freedom are parti- 
tioned into sums of squares and df 
for all main effects, interactions and 
error. This paper presents for psy- 
chologists a general form of multiple 
contingency analysis based on Lan- 
caster’s work, provides an illustra- 
tion, and comments on the generality 
of application of the method. 


MULTIPLE CONTINGENCY 
ANALYSIS 


Complex contingency tables of 
frequency data from multiple classi- 
fication designs may be of several 
forms according as the sampling of 
main effects is random or restricted. 
(a) The random case imposes no 
sampling restrictions. For example, 
after a random sample has been 
taken, it may be classified in various 
ways and the frequencies within 
classes will be due only (within 
sampling limits) to the population 
proportions. (b) The mixed case in- 
volves restrictions upon the propor- 
tions within categories of given classi- 
fications and freedom with respect to 
others, e.g., arranging in advance that 
a total sample will involve equal 
proportions of the sexes. Parameters 
for a classification are defined by its 
restriction. Whichever case is in- 
volved, for each observed frequency in 
the table there will be an expected 
value, and hence divergence of the 
total table from expectation may be 
tested through x?. Within a total 
table, however, there will be a num- 
ber of sources of variation comprising 
main effects and interactions, and to 
isolate them one would need to par- 
tition the total x? and df into inde- 
pendent additive components due 
to such sources. The problem is 
to specify the expected frequen- 
cies which will meet this require- 
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ment. The method will be developed 
through a notation which, while 
perfectly general, will for simplicity 
be set out for an A XBXC design. 

Let the classifications be symbo- 
lized as A, B, C,--+,L. Let A be 
subdivided into @ categories repre- 
sented generally by the subscript i 
which thus takes values from 1 to a. 
Similarly B is represented by (j 
=1,---, b), C by (k=1,---, ©), 
etc. Let pPij=the probability of an 
observation falling in the «kth cell; 
Oi =the observed frequency in the 
ijkth cell; and ejn=the expected 
frequency in the ikth cell. Let a dot 
in place of a subscript represent sum- 
mation across the values represented 
by the subscript, e.g., 


i—a 
pF Osjh= 0. jhe 
tm 


Let the total sample size 0... = N; and 
finally p...=1.0. On the hypothesis! 
of zero interaction, pPij=Pi.. XP y. 
Xp. ky Pu. =Pi..XPy ’ etc. These 
parameters are used to find the ex- 
pected frequencies, €.g., Cis = Pijr 
< N, and hence x? may be calculated 
as > (o—e)*/e. Now some or all of 
the values of the parameters p,.,,, 
P.4. P.% may be (a) known from the 
population; or (b) estimated from 
the sample, e.g., pi..=0;../N. These 
situations taken with the random 
and mixed designs provide four cases 
each requiring separate consideration. 
Case (Ja) will be presented in full 
for the A XB XC design. 


(la) Random sampling, known pa- 
rameters 


The partition of total x* and df 
into component values for this case, 


! Ordinarily one works with the null hy- 
pothesis, but population hypotheses of non- 
zero interactions may be entertained, e.g., in a 
test of goodness of fit with case Ja, and again 
in determining the power of the test of signifi- 
cance for a given situation. 
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TABLE 1 
PARTITION OF x? AND DF For Case (la) 


Number Source 


A 


B 


> 

x's= > (04.—e4.)*/e, 
id 

xtom > (0. ae. a)*/e.4 
a n 

apm DD (04. —€4y)*/e44 
a © 

x4c™ ys a (o an & a)*/e 


4 6 
x’ac™= ye > (0 ju—e gu)*/e ju —(24-3) 


Cc 
AB 
AC 


BC 





(a—1) 
(b—1) 
(c—1) 
—(1+2) (a—1)(b—1) 
ia ~(1+3) 


(a—1)(c—1) 


(b—1)(c—1) 


xtape= > Dd Dd (oign—eiye)®/eiju — 1 4+24+344454+6) (a—1)(6—1)(c—1) 


together with computing formulas, 
are set out in Table 1. As the popu- 
lation values are known, the sig- 
nificance of all main effects and inter- 
actions may be assessed. 


(1b) Random sampling, parameters 
estimated from the data 


In this case one estimates popula- 
tion proportions from the sample 
data, e.g., pi..=0,;../N, and as e;.. 
=p; XN, then e;..=0;... Accord- 
ingly for this case the values of x? and 
df for the main effects are zero. One 
may assess all interactions and their 
df are unchanged, but the total df is 
reduced by the number lost with the 
main effects. 


(2a) Mixed case, known parameters 
Here restriction specifies the pa- 
rameters for a classification, so that 
the main effects and df for the re- 
stricted classifications are zero. Fur- 
thermore, if several classifications 
and their subclasses are restricted, 
within that set the interactions and 
df are also zero. For instance, if in 
an AXBXC design the proportions 
within the A and B classes and AB 


a Ly e 
r= >> 2 } (Oije— Ccju)®/Cejn 


(abe —1) 


classes are prearranged and sampling 
is random only with respect of C, 
then one has set ¢;..=0;.., €.;.=0.;., 
C43. = 04;. = (05..X0.;.)/N. In this 
case one would obtain 


r= xXPetxactx act x*ABC, 


and the total df would be reduced by 
the number lost with the main effects 
and interactions. 


(2b) Mixed case, parameters estimated 
from the data 


Here one loses all the main effects 
and such interactions as involve only 
the restricted classifications. For the 
case with A and B restricted and C 
random, 


x*r = uctx'se + x7ABC, 


and as before the total df has to be 
adjusted for the number lost with the 
main effects and interactions. 


ILLUSTRATION OF THE METHOD 


An experiment is reported (7) in 
which the manner of resolving con- 
flict (A, 1=1, 2) is observed under 
four conditions constituting the fac- 
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torial arrangement of two conditions 
of social distance (B, j7=1, 2) and 
two conditions of publicity (C, k 
=1, 2). Four independent random 
samples of 100 cases were assigned to 
the four conditions, and the whole 
experiment was replicated for eight- 
een conditions of social sanction 
(D, l=1, 2,-+-, 18). In this way 
equal numbers were subjected to all 
treatments and the only main effect 
frequencies free to vary were those 
pertaining to type of conflict resolu- 
tion. That is, A is random and B, 


C, and D and their subclasses are re- 
stricted in an AXBXCXD design. 
As the population proportions for 
type of conflict resolution were un- 
known, they were estimated from 
the sample data. Hence the analysis 
follows the (2b) type, where 


Xr = x Captactxaptxss¢ 
+x24co +x appt xascp 
dfr= (bcd —1)(a— 1). 


GENERALITY OF APPLICATION 


As presented the method has ap- 
plication to factorial experiments in 
which information on the dependent 
variable is in frequency form. Equal- 
ly the method may be applied to sur- 
veys where sampling units are classi- 
fied in a varjety of ways. A further 
application of this type of method to 
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measurement data has recently been 
suggested by Wilson? (8) who uses 
it as a “‘distribution-free’’ substitute 
for the analysis of variance. He does 
not justify this substitution and some 
comment is warranted. Wilson di- 
chotomizes his measures at the 
median and, in effect, introduces the 
dependent variable as an additional 
classification with two levels. In- 
formation is lost in categorizing and 
to that extent a test of significance 
with frequencies is less sensitive 
than one applied to measures. Hence 
one would only use with measure- 
ment data multiple contingency 
analysis as a substitute for analysis of 
variance when the latter method was 
not applicable. This would be so 
when certain assumptions required 
for a valid F test could not be met— 
normality of parent population, ho- 
mogeneity of variance—and a suit- 
able transformation was not avail- 
able. Here in the absence of the more 
sensitive test, the less sensitive test 
would certainly be preferable to none 
at all. 


? Wilson's procedures are based upon a 
particular hypothesis about the expected 
values, viz., irrespective of treatment effects 
a score on the dependent variable is equally 
likely to occur above or below the median. 
It should be noted that this is not the only 
population hypothesis which may be enter- 
tained. The method presented in this paper, 
being more general than Wilson's, is to be 
preferred on that score 
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With the increase in recent inter- 
est in Q technique stimulated pri- 
marily by the work of Cattell (1) 
and Stephenson (3), researchers in 
the area of clinical-personality-social 
have needed frequently to compute 
Pearsonian product-moment coeffi- 
cients of correlation from Q arrays 
in large numbers. Many of them 
have simply followed the standard 
textbook computing methods which, 
because of the special conditions in- 
volved in correlating Q arrays, are 
exceedingly inefficient. The purpose 
of this note is to present a quick 
method for computing such corre- 
lations. 

The special conditions referred to 
above are these: All the Q arrays to 
be intercorrelated have, because of 


the instructions for Q sorting, exactly 
the same distribution, and therefore 
identical means and standard devi- 


ations. The most efficient formula 
for computing rs under these circum- 
stances is derived from the formula 
for the product-moment r by the 
“method of differences” (2, p. 118, 
formula 186). 


where D represents the difference be- 
tween paired scores. 

Since the two distribution stand- 
ard deviations are equal, this formula 
reduces to 


(2) 


‘From the Psychology Service, Franklin 
D. Roosevelt Veterans Administration Hospi- 
tal, Montrose, New York. 
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The o*p term can be readily ex- 
pressed in raw-score terms as fol- 


lows: 
ee 


oD N — Mp, (3) 


where Mp represents the mean of 
the paired differences. 

Since the mean of the paired dif- 
ferences must equal the difference 
between the means, which is zero, 


2 Dt 


N 


ap= 


(4) 


Substituting in Equation 2, 


In any given Q-technique research, 
the denominator of the fraction is a 
constant, K, for all the correlations 
to be performed, since both N, the 
number of statements, and o?, the 
variance of the forced frequency dis- 
tribution of scale values, are con- 
stant. Equation 5 therefore becomes 


[6] 


For any given correlation, the 
>-D? is readily found and substi- 
tuted for the arithmetic computation 
of r. 

Since Equation 6 is a linear equa- 
tion, however, whenever the number 
of correlations to be found becomes 
large, it is a simple matter: to con- 
struct a nomograph and read off the 
values of r (see Fig. 1). The nomo- 
graph is constructed as follows: 

1. On a sheet of graph paper ori- 
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ented so that the long sides are at 
left and right, lay off along the left 
side values of }~D* from zero at the 
bottom to K at the top, and on the 
right values for }~D* from K at the 
bottom to 2K at the top. 

2. On the bottom horizontal lay 
off values of r from +1.00 at the 
left to .00 at the right. On the top 
horizontal lay off values of r from 
.00 at the left to —1.00 at the right. 

3. Draw a straight line from the 
lower left corner (r =1.00, }-D*=0) 
to the upper right corner (r = — 1.00, 
> D? =2K). 

In using the nomograph, when 
>-D* is entered from the left, r is 
read off at the bottom; when }>D? is 
entered from the right, r is read off 
at the top. 

Numerical illustration. A set of 100 
statements is Q sorted into a forced 
distribution with a o* on the Q scale 
of 4.0. K therefore equals 2(100) 
(4.0) = 800 and Equation 6 becomes 


2. D* 


pai. . 


800 
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For illustrative purposes, the 
nomograph of this equation is given 
in Fig. 1. 














_ 1 
40 20 
r 


Fic. 1. EXAMPLE OF NOMOGRAPH WHEN 
K = 800. 
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A number of techniques have been 
published (e.g., 1, 2, 3, 4, 5, 6) to 
assist in the evaluation of the signifi- 
cance of differences in proportions, or 
frequencies, arranged in 2X2 con- 
tingency tables. All of the available 
methods suffer from limitations either 
in generality of problems to which 
they may be applied, in the arith- 
metic computations involved in their 
use, or in the appropriate nature of 
the obtained probability levels. 

The present method was designed 


to overcome two of these limitations. 
The arithmetic operations have been 
reduced to simple addition and sub- 
traction of the cell entries and the ob- 
tained significance levels are based 
on the chi-square distribution. The 
one limitation of the technique is the 
requirement that for maximum use- 
fulness the two samples being com- 
pared must be independently selected 
and contain the same number of sub- 
jects. Nevertheless, it is still possible 
in certain cases where the require- 
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ment of equal sample frequencies is 
not met to obtain a conservative 
probability statement. 

In developing the present tech- 
nique, use was made of certain rela- 
tionships inherent in the computing 
formula for chi square for 2X2 con- 
tingency tables. Consider the fol- 
lowing table. 


Sam- Sam- 
ple ple 


l 2 
Classifi- 


cation X b at+b=(30) 


Classifi- 
cation Y 


c+d=(60) 


a+b+c+d=<N= 
(90) 


DIFFERENCE CURVES 
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where a, 6, c, and d are cell frequen- 
cies. 

If the two column totals, a+c and 
b+d, are equal and if a+ is chosen 
to be equal to or less than one-half 
the total number of subjects (4N),! 
and further if we set 


a+b=S and a—b=D, 
it can be shown that 
DPN 
v= — - 
S(N—S) 
Solving this for N gives 
Sy? 
xS— DD 
! The restriction that the smaller of the two 
row totals be used is not necessary for the de- 
velopment of Formula 1. However, in order 


to increase the legibility of the charts, the 
requirement was imposed. 
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For a fixed value of chi square, rep- 
resenting any desired level of signifi- 
cance, Equation 2 may be solved for 
all values of N corresponding to vary- 
ing values of S and D which meet the 
requirement that SS4N. If D is 
also fixed, and only S allowed to vary, 
a series of Ns may be computed 
which, together with their corre- 
sponding Ss, determine the coordi- 
nates of a curve for each fixed D for 
the selected probability level. 

In Figures 1, 2, 3, and 4 are plotted 
the N, S curves for several fixed Ds 
for the critical values of x?(df =1) for 
levels of significance .001, .01, .05, 
and .10, respectively. In computing 
the curves, a maximum WN of 200 (100 
subjects per sample) was arbitrarily 
selected as the upper limit. The 


lower limit of 40 (20 subjects per sam- 
ple) was chosen as a prudent mini- 
mum sample size. The lower limit 
of the S values was selected as 10 so 
that, under the null hypothesis, the 
value in cells a and b would be at 
least five.? 

It should also be noted that the D 
value assigned to each curve in the 
charts is one unit larger than that 


* The chi-square values used were: 10.827 
for the .001 level; 6.635 for the .01 level; 3.841 
for the .05 level; and 2.706 for the .10 level. 
Computations were done twice, independ- 
ently, and then spot checked a third time. 
During the plotting of each curve any erratic 
fluctuations were noted and the computations 
rechecked. It is felt that the curves are ac- 
curate within the limits of the numerical 
values used and the usual limitations of plot- 
ting. 
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used in the computation of that 
curve. For example, using a chi- 
square value of 2.706 (p=.10) the 
curve obtained for a D value of 7 
would be labeled on the chart as 8; 
that for a D value of 8 labeled as 9; 
and so on. This is necessary since the 
region between any two curves (re- 
ferring now to the D values actually 
used in the calculations), for example 
D=7 and D=8, represents values 
greater than 7 but less than 8. Since 
in practice only integral D values 
may be obtained, all the fractional 
values greater than 7 are theoretical 
only. Actually a difference of 8 must 
be obtained before significance can 
be reached. Consequently, the D=7 
curve must be labeled D =8. 

To use the charts, it is first neces- 


sary to determine which pair of cell 
entries, a+ or c+d, is the smaller. 
Choose S=smaller (a+b or ¢c+d). 


With this sum (.S) on the abscissa and 


the total number of cases in both 
samples (NV) on the ordinate enter the 
chart appropriate for the desired 
level of significance. The difference 
curve (D) immediately to the left of 
the intersection of the S and N 
values is the minimum difference be- 
tween the two selected cell frequen- 
cies required for statistical signifi- 
cance. 

As an example, consider the nu- 
merical values given in parentheses 
in the preceding 2 X2 table. The two 
row totals are 30 and 60. The chart 
is entered with S=30 (since 30<4N 
= 45) and with N=90. To determine 
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if the difference of D=21—9=12 is 
larger than what one would expect by 
chance at, say, the .01 level, enter 
Fig. 2 and note the D curve immedi- 
ately to the left of the intersection of 
Sand N. For this example, it is ob- 
served that any D212 will result in 
statistical significance for S=30 and 
N=90. Since our observed D=12, 
we may conclude that the difference 
is significant at the .01 level. 

Yates’ correction may be applied 
by subtracting one from the obtained 
difference between the selected pair 
of cell values. In the example just 
given, the obtained difference of 12 
would be reduced to 11 and the dif- 
ference would no longer be significant 
at the .01 level. 


DAVID K. 


TRITES 


When the frequencies in the two 
samples are unequal, a test of signifi- 
cance may be obtained by reducing 
the larger sample to the size of the 
smaller, keeping the original propor- 
tions in the cells unchanged. The S 
and D values are then determined in 
the usual manner and the appropriate 
chart entered with a total N equal to 
twice the frequency oi the smaller 
sample. If the graph indicates sig- 
nificance for this comparison, the 
probability of the difference in the 
original data must be even smaller. 
Of course, if after reduction of the 
larger sample the comparison is not 
significant, no statement can be made 
regarding the significance of the dif- 
ference in the unreduced data. 
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Psychologists, as careful scientists, 
have in their research and teaching 
been greatly concerned with the 
problems of methods and _ experi- 
mental controls. One of the common 
methods used in animal experimenta- 
tion is the litter-mate control or split- 
litter technique. Munn in his discus- 
sion of general procedures in research 
with the rat presents the rationale 
underlying the use of this technique 
when he states, “In many experi- 
ments it is necessary to use two or 
more groups of comparable 
constitution, a control group and one 
or more experimental groups. The 


genetic 


closest one can come to achieving 
comparability is to have the differ- 
ent groups consist of litter mates”’ 


(8, p. 7). 

Munn refers to a paper by Corey 
(2) which was concerned with the 
problem of the initial equating of 
control and experimental groups. In 
1930 Corey pointed out the“ . . . mis- 
placed confidence in the uniformity 
of the subjects that is supposed to be 
gained through the use of inbred 
stock” (2, p. 287). His evidence was 
gained on 160 rats from a colony bred 
for about one year from six pairs of 
Wistar rats. Coefficients of correla- 
tion for learning performance be- 
tween litter halves were: trials, .78; 
active time .72; errors, .80; and total 
time, .30. It is clear from these data 
that there is a pronounced correla- 


' This paper was prepared at the Division of 
Behavior Studies, R. B. Jackson Memorial 
Laboratory, Bar Harbor, Maine, during the 
summer of 1955. V.H.D. was a Carnegie Fel- 
low, S.R. and B.E.G. are Scientific Associates 
of the Laboratory. 


tion between litter halves indicating 
significant litter differences. Further 
there is the implication that differ- 
ences in initial (genetic) ability might 
be the basis for these litter differ- 
ences. Corey suggested as ari experi- 
mental design the selection of ani- 
mals to represent a normal distribu- 
tion of the ability of all Ss. He ad- 
mits that this is difficult and suggests 
the usefulness of the split-litter tech- 
nique, pointing out the obvious value 
of this technique in serving to hold 
constant certain environmental fac- 
tors. Corey, however, does not spe- 
cifically deal with the problem of 
genetic control by means of the split- 
litter method. 

Experimental evidence such as we 
have just cited, in addition to the 
considered opinions of many psy- 
chologists, tends to lead to the ac- 
ceptance and wide use of a given tech- 
nique. Quite often certain funda- 
mental assumptions are accepted 
without critical appraisal. We are 
concerned here with a tacit assump- 
tion which we believe has been ac- 
cepted by many psychologists. The 
assumption is that when the split- 
litter technique is used, ali genetic 
factors pertinent to the variable 
under investigation are held constant. 
Thus, any differences obtained are 
strictly a function of environmental 
factors, presumably the treatment 
effects introduced into the experi- 
mental situation. The assumption is 
expressed or implied in the following 
statements made by highly compe- 
tent research workers. “But with 
such small groups, and especially 
without litter-mate this 
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controls, 
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conclusion is entirely gratuitous” 
(8, p. 347). “The only control over 
possible genetic differences was the 
division of each litter into solitary 
animals and normal animals” (1, 
p. 73). “... When heredity is held 
constant between experimental 
groups, by the split-litter meth- 
od...” (5, p. $33). 

The geneticist, however, looks at 
this problem differently than does the 
psychologist. Scott (10) and more 
recently Ginsburg (3) have pointed 
to the importance of the genetic char- 
acteristics of the animal Ss used by 
psychologists. 

Scott discusses the relationships 
between genetic variables and be- 
havioral variability. In regard to the 
use of litter-mate controls he says, 
“. . . litter mates show on the aver- 
age a correlation of .5 in hereditary 
variables. However, since an animal 
gets half of its variable heredity from 
each parent it is theoretically possible 
in small samples to get litter mates 
Litter- 


which are entirely different. 
mate controls should be considered as 
a means of controlling age and en- 


vironment, although it has been 
shown that since the animals affect 
each other, and since the eggs may be 
lodged in different parts of the uterus, 
the environment is far from identical 
for each animal” (10, p. 529). 
Ginsburg in discussing the problem 
of the genetic variables which are 
frequently ignored in psychological 
investigations says, “In a small or 
moderate sized animal colony that 
has not been rigorously inbred or sub- 
jected to the strictest selection with 
respect to the trait in question, the 
use of litter-mates does not consti- 
tute an adequate genetic control. If 
there is appreciable heterozygosity 
for genetic factors affecting the ex- 
perimental outcome, these will segre- 
gate among litter-mates, making 
them genetically unlike each other. 
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Under the conditions just discussed, 
it would be preferable to select ex- 
perimental and control samples at 
random from the colony, than to use 
the split-litter technique where, what- 
ever the number of litters or animals 
involved in the latter procedure, they 
trace back to a limited number of rel- 
atively recent matings within the 
colony” (3, p. 41). 

We believe that Ginsburg’s defini- 
tion of the type of animal colony for 
which the split-litter technique is not 
an adequate genetic control is de- 
scriptive of many of the animal 
colonies which are used by psycholo- 
gists as sources of experimental Ss 
(probably including Corey's Ss). 
First, the colonies are small to mod- 
erate in size (N less than 500). Sec- 
ond, the strains used are rarely pure 
strains. In fact most departments 
which maintain rat colonies for ex- 
perimental purposes have only par- 
tially inbred lines. These animals 
may differ markedly one from the 
other, even though started from a 
(roughly) genetically similar pair of 
rats. Since these are partially inbred 
strains, and strains in which the selec- 
tion for the traits under experimental 
investigation is relatively unknown, 
there may be segregation of genes 
which can affect the outcome of the 
experiment. 

A consideration of genetic theory 
will easily demonstrate why this is so. 
In order to achieve complete genetic 
uniformity, we must produce indi- 
viduals whose genotypes are identi- 
cal. In so far as they are different, 
we may expect to get variability which 
is compounded due to the recombina- 
tion of genes at each mating, much 
as cards are recombined after being 
shuffled and dealt anew. The pheno- 
typic value of a given gene depends, 
in part, upon the genes with which it 
interacts, just as the value to the card 
player of the queen of hearts depends 
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in part on the other cards he holds 
in his hand. The magnitude of phen- 
otypic variability, which is the factor 
of greatest interest to the psychologi- 
cal researcher, is a function of the 
possible combinations of genes rather 
than of their actual number. In- 
breeding reduces the field of genetic 
variability but may actually increase 
phenotypic variability in some cases 
(9). These include situations where 
a given genotype that may become 
fixed as a result of inbreeding is more 
susceptible to environmental influ- 
ences than most other genotypes, and 
cases of multiple factors affecting a 
quantitatively varying character. In 
the latter case, if homozygosity repre- 
sents the extreme phenotypic types, 
partial inbreeding will increase the 
standard deviation of the population 
with respect to the character as the 
homozygous types increase at the ex- 
pense of the heterozygotes. Calcula- 
tions of the degree of genetic rela- 
tionship probably achieved in appli- 
cation by a given system of inbreed- 
ing over a known number of genera- 
tions may be easily made (11, 12). 
While such calculations of genetic 
relationship will yield valuable in- 
formation regarding the expected de- 
crease in heterozygosis under a given 
system of inbreeding, this informa- 
tion is with respect to either the 
average heterozygosity for all genes 
in a given animal or the average con- 
dition for a pair of alleles in the popu- 
lation as a whole. This is not particu- 
larly useful when we are concerned 
with the possible effects of one or a 
few specific loci in an experimental 
situation where all of the hazards of 
sampling error apply. 

The process of mating close rela- 
tives, therefore, is, by itself, no 
guarantee of actual genetic uni- 
formity. Selection, conscious or un- 
scious, may favor heterozygosity for 
factors affecting fertility, viability, 
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etc. Thus a number of loci may be 
prevented from stabilizing. There 
is good evidence that sublines of a 
single inbred line separated after 
original inbreeding of more than 20 
to 30 generations of brother &X sister 
matings have, after their separation, 
shown evidence of genetic differences. 
The basis of these differences is either 
different segregation of residual het- 
erozygosity in the common parents of 
both sublines or newly arising muta- 
tions. It is difficult to establish which 
process occurred in any particular 
case.” 

The inbred animals, to be sure, are 
alike for a great many genes, but 
the genes for which they remain vari- 
able would have to be ruled out as af- 
fecting the experimental situation be- 
fore the use of the split-litter method 
would constitute an adequate control 
for genetic factors. A test of this is 
to obtain the parent-offspring correla- 
tion for a number of litters on the 
trait which is of experimental inter- 
est. If the correlation is high, two 
things are indicated. First, genetic 
factors and/or some environmental 
factors (which can be experimentally 
eliminated) are affecting the char- 
acteristic being measured. Second, 
these factors are relatively constant 
within a litter. In this situation the 
split-litter method is of value. If the 
parent-offspring correlation is low, 
this might be due to any of several 
reasons, some of which are: (a) ge- 
netic factors do not affect the trait 
being measured, (6) all genetic factors 
influencing this trait are identical for 
all litters, (c) recessive genes are seg- 
regating, or (d) mutations possibly 
are occurring. Regardless of the basis 
of the low correlation, splitting litters 
in this situation is of no value and 
may even be detrimental to the 
efficiency of the experiment. 

? Dr. Elizabeth S. Russell, personal com- 
munication, January 12, 1956, 
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Even highly inbred strains may 
not remain uniform forever. Thus, 
the DBA/1 Jax mice, which two of us 
have used as Ss, and which consti- 
tute as pure an inbred population as 
one is likely to get, is composed of a 
number of sublines which have 
changed from each other in time 
after having been inbred to the point 
of practical genetic identity. This 
was made evident when the popula- 
tion from which we had drawn our 
Ss was decimated by the fire at the 
R. B. Jackson Memorial Laboratory 
in 1947, DBA/1 mice, descended 
some generations back from the same 
ancestors as the ones we had been 
using, were sent in to the Laboratory 
to replace the ones destroyed. Some 
of these animals reacted like the pre- 
fire mice, but others did not in a situ- 
ation where genetic variables were of 
primary importance (3). The con- 
tinuing genetic identity of an inbred 
strain can, therefore, not be taken for 
granted. Responsible geneticists and 


laboratories engaged in the produc- 
tion of inbred strains must and do 
continually check their materials to 


make sure that the strains remain 
constant. 

We do not mean to imply by this 
that the method of litter-mate con- 
trols should not be used. What we do 
wish to state is that this procedure in 
no way guarantees that the genetic 
factors, which are liable to influence 
the results of the study, are neces- 
sarily held constant when either 
partially inbred strains or randomly 
bred animals are used. 

Perhaps this point can be further 
clarified by examining the extreme 
conditions. If isogenic strains (all Ss 
in the colony have identical genes) 
are used, splitting litters cannot con- 
trol for hereditary variables since 
these are already constant. Splitting 
the litter will in part control for cer- 
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tain environmental factors which are 
constant within a litter, but differ 
from litter to litter (e.g., maternal 
influence and age). At the other end 
of the continuum, one could use off- 
spring from mongrel animals pur- 
chased from pet shops throughout the 
country and randomly mated. One 
would not expect the offspring of the 
animals within a litter to be geneti- 
cally more similar in regard to the 
behavior being investigated than off- 
spring from different litters, since all 
the parents have a random assort- 
ment of genes. Here again the split- 
litter method would equate for cer- 
tain environmental factors. The Ss 
generally used in psychological re- 
search are somewhere between these 
two extremes, possibly nearer the iso- 
genic situation. Thus, we would ex- 
pect that when partially inbred ani- 
mals are used, splitting a litter will 
equate for some of the hereditary 
factors influencing the behavior being 
studied, but not all of these factors. 
Genes not yet within a 
colony will be segregating within 
litters and producing genetically un- 
like litter mates. Here again the 
splitting will control for certain en- 
vironmental factors, but these only 
so far as they are constant within a 
given litter. 

From the statistical point of view 
the above discussion would seem to 
indicate that the split-litter method 
can be of value when used with the 
appropriate analytical procedures. 
That is, the split-litter method may 
be considered to be a stratified ran- 
dom sample in which the litters are 
the strata and the Ss within the lit- 
ters are the sampling units. Thus, if 
one used a “matched groups” design 
and removed “‘litters’’ as a source of 
variance, this would appear to re- 
move some (though not all) of the 
genetic variability influencing the 


stabilized 
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behavior being measured. The error 
term would be reduced and greater 
precision to the test of significance of 
the independent variable would be 
achieved. This argument is not nec- 
essarily true. Hansen, Hurwitz, and 
Madow (4, ch. 5) in their discussion 
of stratified sampling point out that 
biased estimates and loss of precision 
can occur by sampling from very 
small strata (litters) and by the use 
of relatively few cases within each 
stratum. Under these conditions 
random sampling is preferable. 

A comparison of random and strat- 
ified sampling can be made by consid- 
ering a single locus at which a dom- 
inant gene and (in the simplest case) 
one recessive allele are involved and 
in which the occurrence of hetero- 
zygosity will affect the experimental 
results. Since, in most cases, we are 
unable to detect the heterozygous in- 
dividuals in any obvious way, our 
problem becomes that of achieving a 
nearly equal distribution of heterozy- 
gotes between our experimental and 
control groups. To this end we have 


a choice between sampling the animal 
colony at random and equating the 
experimental and control Ss on the 
weight, etc., or 
equating the groups by the split- 


basis of age, sex, 
litter technique. Essentially this is a 
choice of taking either the individual 
or the mating as our unit of sampling. 
In the former case, the probability of 
including some heterozygotes in the 
experimental or control group is a 
function of the frequency with which 
such individuals occur in the colony, 
and of the sampling error. In the 
latter case, the probability is a func- 
tion of the frequency with which mat- 
ings capable of producing such indi- 
viduals occur in the colony, and 
again, of the sampling error. ‘On the 
simplified assumptions of random 
mating in the absence of mutation 
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and selection affecting the trait in 
question, and on the further assump- 
tion that the genes under considera- 
tion are equally distributed in the 
two sexes, the relative frequency 
with which a particu'ar genotype 
may be expected to occur can be 
calculated either from a considera- 
tion of the types of matings possible 
and the relative frequency of each, 
or of the gametes available in the 
colony and the frequency of the geno- 
types to be expected as a result of 
their random combinations. 

Using the first method, there are 
six possible matings to consider: 


1. AAXAA 
2. AAXAa 
3. AA Xaa 
4. AaXaa 
5. AaXAa 
6. aaXaa 


Of these, No. 6 may be eliminated, 
since it can be identified by the phen- 
otype and does not contribute to the 
heterozygosity in any case. Nos. 3 
and 4 may also be excluded, even 
though they do contribute to the 
heterozygosity of the succeeding gen- 
eration because they, too, are identi- 
fiable phenotypically. No. 5 may be 
identified through the occurrence of 
recessive progeny, thus leaving 1 and 
2 as the source of our population. On 
the assumptions enumerated, if X 
represents the proportion of AA in- 
dividuals in the parental generation, 
and Y represents the proportion of 
Aa individuals, then the frequency 
with which type 2 matings (produc- 
ing heterozygotes) occur may be de- 
rived from the expression (X AA 
+ Y Aa)? and is equal to 2X Y. Only 
half the progeny of such matings, or 
XY, will consist of heterozygous in- 
dwiduals. It is thus evident that 
whatever the absolute numbers may 
be in a given case, when the individual 
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is taken as the sampling unit, heter- 
ozygosis will occur only half as fre- 
quently as when the mating is taken 
as the unit. Given the usual circum- 
stances of relatively small numbers 
of Ss in each of the groups, the sam- 
pling errors become appreciable and 
the most reasonable supposition is 
that heterozygosis encountered in a 
litter that is split between experi- 
mentals and controls will not be 
equally distributed. 

Using the second method and the 
same assumptions, the relative fre- 
quency with which a particular geno- 
type may be expected to occur can 
be obtained by expanding the bi- 
nomial [¢+(1—q)|*? where gq equals 
the frequency of gene A and (1—q) 
equals the frequency of gene a (6, 
ch. 1; 7, ch. 6). These methods may 
be extended to additional independ- 
ent pairs of alleles by making the cal- 
culation for each pair separately and 
combining these through the use of 
the product law. In the case of a 


multiple allelic series, the gene fre- 
quency notation is extended by using 
a polynomial corresponding to the 
number and frequency of the alleles 
involved, rather than a binomial, as 


in the simpler model cited here. 
Where the genes are not equally dis- 
tributed between the sexes, 
rate polynomials representing 
distribution for each 
used. 

In addition to the statistical and 
genetic arguments there are several 
experimental points to consider as 
well. Though certain of the environ- 
mental factors within a litter are 
constant for all Ss, there are other 
factors which are variable and which 
make for dissimilarity among litter 
mates. Some of these factors are 
litter size, competition for food, pat- 
terns of dominance and aggression, 
and differences in uterine environ- 
ment. If these variable factors are 


sepa- 
the 


sex must be 
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ones which will influence the depend- 
ent variable, then splitting litters in 
no way equates for them. Indeed, 
this procedure is likely to lead to 
larger intralitter differences than 
interlitter differences. Thus the pre- 
cision of the test of significance may 
be reduced as compared to a purely 
random design. 

In conclusion, then, we feel that 
the split-litter procedure, when used 
with the realization that some genetic 
factors and some environmental fac- 
tors are probably controlled while 
others are not, can be an efficient de- 
sign. However, the experimenter 
should be aware of the shortcomings 
of this method and should not apply 
it blindly to all problems. In coming 
to a decision as to the advisability of 
using the split-litter procedure, the 
research worker has to decide whether 
the control gained over some (gen- 
erally unknown) genetic factors and 
some constant environmental factors 
present within the litter more than 
compensate for the additional vari- 
ability introduced by the probable 
segregation of recessive and infre- 
quent dominant genes and for the in- 
fluence of certain variable factors 
present within the litter environ- 
ment. In addition he should also be 
aware of the possibility of obtaining 
biased estimates and loss of precision 
in his statistical analysis. 

If the experimenter is critically 
concerned with controlling genetic 
variables, then we suggest either us- 
ing isogenic strains of Ss or instituting 
a breeding program to determine the 
genetic bases for the behaviors which 
he is studying. In any case, the split- 
litter technique does not guarantee 
an identical distribution of genetic 
sources of variance to the various 
treatment groups in the frequent in- 
stances of partially inbred lines of 
rats, dogs, or other experimental ani- 
mals from small breeding colonies. 
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In studying communication and 
interaction among individuals, it fre- 
quently happens that the investigator 
observes some quantity associable 
with a pair of persons, but not ob- 
servable in with either 
individual considered separately. Ex- 
amples are attitude agreement be- 
the two 


communication 


connection 


amount ol 
them, be- 
longing to the same family, or any 
relation which may be 
symmetrical. When such binary re- 
lations are studied within groups of 


tween persons, 


between 


treated as 


persons, it may occur that a person 
who is a member of one pair from 
which a measure is taken will also be 
a member of another pair from which 
a measure is taken. The experimenter 
then himself with 
which are not experimentally inde- 
pendent because the same individual 


finds measures 


contributed to both measures. This 


paper offers a method, given a collec- 


tion of scores obtained by observing 


pairs of persons, of constructing 
scores in which the contributions of 
individuals are held constant so that 
the variability among the resulting 
may be attrib- 


com- 


“interaction scores” 
uted to the 
munication or agreement) specifying 
the obtained pair-scores and not to 
with the 
In regard to 


conditions (e.g., 


characteristics associated 
persons individually. 
establishing a score for an observed 
unit such that the score is not biased 


bv the fact that different units over- 
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lap the same individuals, this method 
may be considered a contribution to 
the same problem-a-7a treated in 
recent papers by Luce, Macy, and 
Tagiuri (2), by Tagiuri, Bruner, and 
Kogan (3), and by Winer (4). 


THE LINEAR HYPOTHESIS 


The present method of construct- 
ing “interaction scores” from a collec- 
tion of scores obtained from the ob- 
servation of pairs of individuals rests 
on the hypothesis (cf. 1, Ch. 5, 6) that 
the obtained pair-score y,;; consists of 
a linear combination of 


=the population mean pair- 
score over all pairs, 
b; =the deviation from the mean 
due to person 4, 
b, =the deviation from the mean 
due to person j, and 
(b);;= the deviation from p+5;+5; 
due to the interaction of per- 
sons i and j. It is the estima- 
tion of this last component 
in which we are interested. 
Accordingly, we begin by assuming 
that 
Vise + b+ b5+ (b) si, {1} 
and summing the pair-scores over the 
persons j7, we have for any person 1, 
yi. =u t+b,+-6.4+(b),.. [2] 
Now, since the population of persons 


is > i= > j, the sums of deviations 
from the mean are: 
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> i= > b;=0, and 

‘ j 
Dd (b)is= > (b= > (6) .s=0. 
° 5] J 


Then, summing the pair-scores con- 
taining person t, we have from 


Equation 2: 
Dd vis=(N—1)e+(N—1)5,+-0+0 
| 


where N is the number of persons, 
and N—1 the number of pairs which 
include person 1. Or, 


And for all pairs, similarly, from 
Equation 1: 


Do ys 
iJ 


N(N—1)_ 


2 


And for any sample, respectively, we 
have the expec tations 


=p+-b,, and [3] 


[4] 





= 


Thus from Equations 3 aud 4, the 
estimated deviation score of person 1 
is 


D9 
P| 


6;= - 
N-1!1 


Lo vis 

‘2 
N(N—1)/2 
Now from Equation 1 we have 


(b) «37 Vis — 3 — 5, 
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and substituting the estimates we 
have 


() = Vig — Mj —},— by, 


which with the appropriate substitu- 


tions becomes 
) os yes 
Che © 9 ees 
7 NU ~1)/2 
> My 


v—-1 


for the estimated interaction between 
persons 7 and j. 

Finally, the variance of the inter- 
action within the pairs is 


>> (b)%; > (bi 


‘2 2 


N(N—1) N(N—1) 
Po ey 


and since the last term is zero, 


N(N—1) . 
— > (b)%5% = (6) 


- - s?= 
~ ‘7 


2 - 
J (b)4;> 


TESTING THE SIGNIFICANCE OF THE 
DIFFERENCE BETWEEN MEAN 
INTERACTION EFFECTS 

For the pair (7, 7) in a group of 
N(N—1)/2 pairs, the estimated in- 
teraction effect on the pair-score is 
given by Equation 5. Now, since 
>°(6),;=0, the mean interaction ef- 
fect on a subset of the N(N—1)/2 
pair-scores will be significantly dif- 
ferent from the mean effect on the 
remaining pair-scores if and only if 
the effect on the first-mentioned sub- 
set is significantly different from 
zero. 

Therefore, the effect of a treatment 
on the pair-scores of a subgroup may 
be tested for its effect on the mean 
interaction level by computing the 
estimated interaction for each score 
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in the subgroup and testing by the 
t test whether the mean interaction is 
different from zero for that subgroup. 

The data shown in Table 1 were 
collected during a methodological 
study conducted in the summer of 
1953. Thirteen undergraduates at 
the University of Michigan were 
assigned randomly to pairs, and some 
pairs were given issues to discuss con- 
cerning the previous November's na- 
tional election. Thirty-four of the 
possible 78 pairs were discussant 
pairs. Before and after each pair's 
discussion, the attitude of each mem- 
ber of the pair was measured in re- 
gard to possible consequences for 
various areas of national policy had 
Stevenson been elected. Pretest and 
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posttest were five weeks apart. The 
discussion by the several pairs took 
place on different dates. Each pair's 
discussion was about 20 minutes long. 
A measure of attitude agreement was 
computed for each pair. Each figure 
in the body of Table 1 is an index of 
change in attitude agreement within 
a pair from pretest to posttest. 

Each person number at the left of 
Table 1 labels a column and a row. 
The table is read as follows. The pair 
composed of persons 2 and 7 in- 
creased in agreement by an index of 
2, persons 8 and 10 decreased in 
agreement by an index of 1, and so 
forth. The last column at the right 
shows the mean change in agreement 
for all pairs in which person ¢ was a 


TABLE 1 


OpserRveED Scores FOR CHANGE IN AGREEMENT AS TO CONSEQUENCES OF STEVENSON’S 
ELECTION AND MEAN Scores FOR PAIRS CONTAINING PERSON 1 


Obtained Pair-scores 


Persons 


| vi 
st ____s), 987 
N(N—1)/2 


> Vu 
’ 


N-1 


0.250 


0.667 


2.500 


1.333 
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TABLE 2 


COMPUTATION OF INTERACTION SCORES IN 
RESPECT TO CHANGE IN AGREEMENT ON THE 
Part oF Discussant Pairs, FOR TESTING 
DIFFERENCE OF DiscUSSANT MEAN 
VERSUS NONDISCUSSANT MEAN 


Dy 
’ 


> yxy Interaction 
Score 
(d) sj 


Pair t, 7 ej 


— .180 
— .346 
2.904 


2.500 1.237 
2.500 5.987 
1.333 —1.013 


> (b)j= 26.561 


> (is) = 164.728 


member. The mean change for all 
pairs, shown to the left of the table, 
was ().987. 

Table 2 shows the method of com- 
puting the significance of the mean 
interaction effect of one of two treat- 
ments. The discussant pairs are 
listed in the first column of the table. 
Each line of the table represents the 
computation prescribed by Equation 
5 for a particular pair 1, 7. The mean 
for the entire group of pairs, 0.987, is 
added to the obtained score for the 
pair, and from this sum are sub- 
tracted the means for each member of 
the pair. The resulting estimated 
interaction scores represent the 
changes in agreement which are not 
due to a tendency of the group as a 
whole to change in agreement, nor to 
a tendency peculiar to either of the 
individuals, but to the effect of the 
interaction (communication) between 
two particular individuals. These 
scores may next be subjected to the 
t test, and the result for the example 
of Tables 1 and 2 ist = 2.18, which for 
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33 degrees of freedom and two tails is 
significant beyond the .05 level. 
Since this result argues that the mean 
interaction score for the discussant 
pairs is different from zero, and since 
the sum of all the interaction scores 
(both discussant and nondiscussant) 
must be zero, we can conclude that 
the mean interaction score for the 
discussant pairs is greater than the 
mean score for the nondiscussant 
pairs. 

If fewer than all possible pairs 
enter into an experiment, the compu- 
tation proceeds in the same way, but 
with the number of pairs in the de- 
nominators of the terms of Equation 
5 reduced accordingly. 


TESTING THE SIGNIFICANCE OF 
THE VARIANCE RATIO 


Since the variance of a group of 
scores must be computed about the 
group’s own mean, the computation 
of the F test using interaction scores 
proceeds a little differently from the 
test of the difference between means. 
The interaction scores are computed 
separately for the discussant group of 
pairs and for the nondiscussant group 
of pairs. Table 3 shows the observed 
scores for the discussant group with 
the means for each person, and Table 
4 shows the same for the nondiscuss- 
ant group of pairs. Tables 3 and 4 
are read as follows. The cells with 
entries in Table 3 indicate the dis- 
cussant pairs; those in Table 4 indi- 
cate the nondiscussant pairs. (The 
cells with entries in Table 3 are the 
empty cells in Table 4, and con- 
versely.) The means for person ¢ in 
the right-hand column are computed 
by dividing the sum of the entries for 
person « by the number of entries, 
which latter is symbolized in the 
column-heading by ny. In the ex- 
pression for the group mean at the 
left of this table, nm, stands for the 
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TABLE 3 


OBSERVED ScorRES FOR CHANGE IN AGREEMENT AS TO CONSEQUENCES OF STEVENSON’S 
ELection, AND MEAN Scores FoR Pair CONTAINING PERSON i, 


FoR Discussant Pairs ONLY 





Obtained Pair-scores 


Persons 


number of discussant pairs (Table 3) 
and mq for the number of nondiscuss- 
ant pairs (Table 4). 

Tables 5 and 6 show the computa- 
tion of interaction scores. Each row 
in these tables is again a solution of 
Equation 5, except that the mean in- 
dicated by the second term on the 
right-hand side of Equation 5 is now 
the mean of the particular group of 
pairs being examined. 

For each group of pairs we can 
adapt Equation 6. The sum of 
squares for the first group will be 


“ ~ 2 
n\s*\;= > (b)s5 
‘J 


with m,—1 degrees of freedom, and 
similarly for the second group with 


1.713 


3.500 


1.667 


2.333 


0.250 


1.400 


1.856 


n2—1 degrees of freedom. For our 
example, it turns out that the second 
group has the larger mean square, 
and we have 


¥ (ais 
J 


(ns— 1) 
F=— 


>(b,)3, 


This result, nonsignificant, indi- 
cates that the significance of the dif- 
ference in mean pair-agreement be- 
tween the two groups resides in the 
relative level of interaction effect, 
and not in the variability of the inter- 
action scores. 





INTERACTION EFFECTS AMONG OVERLAPPING PAIRS 157 


TABLE 4 
OBSERVED SCORES FOR CHANGE IN AGREEMENT AS TO CONSEQUENCES OF STEVENSON'S 
ELECTION, AND MEAN ScorEs FOR Pairs CONTAINING PERSON 4, 
FOR NONDISCUSSANT PatrRS ONLY 





) os ya 
i 


Persons 


700 


ae Vii 000 
— =(). 386 
Ny 375 


TABLE 5 TABLE 6 
COMPUTATION OF INTERACTION SCORES IN COMPUTATION 
Respect TO CHANGE IN AGREEMENT ON THE 
Part oF Discussant Pairs, FOR TESTING 
THE VARIANCE Ratio 


OF INTERACTION SCORES IN 
Respect TO CHANGE IN AGREEMENT ON THE 
PART OF NONDISCUSSANT PAIRS, FOR 
TESTING THE VARIANCE RATIO 


— > v5 DS vy Enteraction 
Pairi,j 3) ' Score 


Pair 4, j 


, , S « ‘racti 
2 vii | Oe Mage raction 
i ‘ Score 

, 


ny n; (bij t, ny (bay 


400 064 ,! 0 
713 — .377 0 
667 2.669 d 0 


761 
114 
043 


3.000 909 y 375 1.375 
3.000 O18 ; 700 2.000 
1.250 O15 2 000 1.375 


ps (b)iy =0 ¥. (b)i;=0 
Dd (b)% y= 125 DL (b/s = 183 
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SUMMARY 


A method « 
action effer*. 
by obser 
sons has be« 


omputing the inter- 
riables measured 
«acting pairs of per- 
_sesented. The esti- 


mate of the interaction effect utilizes 
the linear hypothesis. This technique 
was developed for use in situations 
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where one person may be a member 
of more than one of the pairs being 
studied. It can be used wherever 
measurements associated with pairs 
are taken and where the experiment- 
er's interest is in the effects of inter- 
action within the pair. Only some of 
all possible pairs of the subjects need 
be observed. 
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“KENDALL'S TAU” 
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In their recent review of the litera- 
ture concerning Kendall's tau Schaef- 
fer and Levitt (3) remark that “‘gen- 
erally applicable tests of the signifi- 
cance of any partial tau are not yet 
available.”’ While this assertion is 
true of the partial rank correlation 
coefficient originally described by 
Kendall (2), a partial tau, applicable 
in situations where the variable 
whose effects are to be removed is 
nominal (4, p. 25), has been de- 
scribed and a test of significance ap- 
propriate to the hypothesis of zero 
partial correlation has been de- 
veloped. The statistic in question 
was first described by the writer (1) 
in 1954. In the same paper the mean 
and the variance of the sampling dis- 
tribution were obtained; and the limit 
distribution was proved to be normal. 
In 1955, and independently, Torger- 
son (5) described the same statistic. 
In addition to the results already 
described Torgerson offered a correc- 
tion for continuity and a discussion of 
tied ranks. Very recently, Torgerson 
republished his original interoffice 
draft in Psychometrika (6). The pur- 
pose of this note is to append a brief 
description of the statistic in question 
to Schaeffer and Levitt’s review. 

Suppose we wish to correlate two 
variables, X and Y, partialling out 
the effects of a nominal variable Z. 
For purposes of illustration let Z be 


! Opinions or conclusions contained in this 
note are those of the author. They are not to 
be construed as necessarily reflecting the view 
or the endorsement of the Navy Department. 


geographical region. 
appear as below 


The data might 


N § : 


26 29 

32 71 
| 45 63 
| g? 
where the numbers are the raw scores 
on X arranged (from top to bottom) 
for increasing values on the paired Y 
score. Ranking the scores in each 
column we have 


NS EW 





L 


V.=3 Vi=S V.=0 Vi=3. 


Let V,; count the number of times a 
higher rank on X, precedes (comes 
higher in the ith column than) a 
lower rank. Letting 


k 
V= DM, 
tl 
V varies between 
hk 
p it (nZe—n,) 
t—! 


M =-——__— j 
; [3] 


and 0, where n,; represents the num- 
ber of cases in the ith column. The 
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partial rank correlation coefficient is 
then defined as 


2V—-M 
M 


In this description we will consider 
only the case of no ties on either X or 
Y. For a discussion of tied ranks, 
see Torgerson (6, p. 151). 

Our concern is with the sampling 
distribution of V under the hypothe- 
sis that any one of the m!- - - my! 
possible patterns in the & columns is 
as likely as any other. The distribu- 
tion has a mean equal to M/2 and a 
standard deviation 
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k 
> ni(ny—1)(2n, +-5) haya 


t=! | 


| 

oy= | ~ 

\ 72 J 
The distribution tends rapidly to 
normality, even with few columns 
and few cases within the columns (6, 
p. 147). Therefore, to test for signifi- 
cance we need only calculate 


M 
(6] 


and refer the result to tables of the 
normal curve. 
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